Background: Identifying the key predictors of childhood Body Mass Index (BMI) is crucial for informing effective health interventions, given the significant impacts of diet, chemical exposures, and metabolomic profiles on health outcomes. Understanding the multifactorial influences is crucial for developing effective interventions and policies.
Objective: This study aims to predict childhood BMI Z-scores by integrating comprehensive data on postnatal diet, chemical exposures, and serum metabolomics, using advanced statistical models.
Methods: Data from the HELIX study, including 1301 mother-child pairs aged 6-11 years, were analyzed. Data was split into training (70%) and test (30%) sets. Various modeling approaches, including LASSO, ridge, elastic net, decision tree, random forest, and gradient boosting machine (GBM), with 10-fold cross-validation used to evaluate model performance and ensure robustness. Models were evaluated based on their Root Mean Squared Error (RMSE) and the significance of the predictors identified.
Results: The results demonstrate that models incorporating a comprehensive set of variables (diet, chemicals, metabolomics, and covariates) consistently outperformed those with fewer variables. The inclusion of metabolomic data significantly improved model performance. GBM model with all variables demonstrated the best performance with a cross-validated RMSE of 0.9537, followed closely by the group LASSO model with an RMSE of 0.8748. Key predictors identified across the models included age and sex, specific metabolites such as metab_161 (coefficient 3.5751), metab_95 (1.3657), and metab_160 (-3.2241), along with chemical exposures like polychlorinated biphenyl-170 (hs_pcb170_cadj_Log2). The random forest model also showed strong performance with an RMSE of 1.0047.
Conclusion: Integrating demographic, dietary, chemical, and metabolomic data provides robust predictions for BMI in children. Advanced machine learning techniques, particularly GBM and Random Forest, significantly enhance predictive performance. These findings underscore the importance of utilizing diverse data types to understand and address childhood obesity effectively.
The impact of postnatal environmental exposures on childhood health outcomes is critical for understanding long-term health implications in children. Understanding the multifactorial influences on childhood BMI is crucial for developing effective interventions and policies to combat this issue. This study aims to understand the impact of postnatal environmental exposures, dietary habits, and covariates on the BMI Z-score of children aged 6-11 years.
Research indicates that postnatal exposure to endocrine-disrupting chemicals (EDCs) such as phthalates, bisphenol A (BPA), and polychlorinated biphenyls (PCBs) can significantly influence body weight and metabolic health (Junge et al., 2018). These chemicals, commonly found in household products and absorbed through dietary intake, are linked to detrimental effects on metabolic health in children. This hormonal interference can lead to an increased body mass index (BMI) in children, suggesting a potential pathway through which exposure to these chemicals contributes to the development of obesity.
The study by Harley et al. (2013) investigates the association between prenatal and postnatal Bisphenol A (BPA) exposure and various body composition metrics in children aged 9 years from the CHAMACOS cohort. The study found that higher prenatal BPA exposure was linked to a decrease in BMI and body fat percentages in girls but not boys, suggesting sex-specific effects. Conversely, BPA levels measured at age 9 were positively associated with increased adiposity in both genders, highlighting the different impacts of exposure timing on childhood development.
A 2016 study by Caspersen et al. investigated the determinants of plasma levels of polychlorinated biphenyls (PCBs), brominated flame retardants, and organochlorine pesticides in pregnant women and 3-year-old children. In pregnant women, significant predictors of increased plasma concentrations of persistent organic pollutants (POPs) included age, low parity, and low pre-pregnant BMI. With the 3-year-olds, prolonged breastfeeding duration was a major determinant of increased POP concentrations. Diet was a significant predictor of PCB levels in children. By examining environmental exposures and their influence on BMI, this project aims to extend the understanding of how specific dietary patterns and chemical exposures during early childhood contribute to BMI variations, aligning with the findings that diet significantly predicts POP levels in children.
These studies collectively illustrate the critical role of environmental exposures in shaping metabolic health outcomes in children, highlighting the necessity for ongoing research and policy intervention to mitigate these risks.
How are postnatal environmental exposures, specifically those found in household products and dietary intake, along with specific serum metabolomics profiles, associated with the BMI Z-score of children aged 6-11 years? Specifically, do higher concentrations of certain serum metabolites, indicative of exposure to chemical classes or metals, correlate with variations in BMI Z-score when controlling for age and other relevant covariates? Additionally, can these metabolites serve as biomarkers for the risk of developing obesity in children?
This study utilizes data from the subcohort of 1301 mother-child pairs in the HELIX study, who are which aged 6-11 years for whom complete exposure and outcome data were available. Exposure data included detailed dietary records after pregnancy and concentrations of various chemicals in child blood samples. There are categorical and numerical variables, which will include both demographic details and biochemical measurements. This dataset allows for robust statistical analysis to identify potential associations between chemical exposure and changes in BMI Z-scores, considering confounding factors such as age, gender, and socioeconomic status. There are no missing data so there is not need to impute the information. Child BMI Z-scores were calculated based on WHO growth standards.
In terms of adding the metabolomic data with the rest of the dataset, 103 (about 8% of the data) values were excluded. Some data entries had no information when combined with the metabolomic data and therefore could not be imputed with the mean or median. While this may pose as a risk, including rows with missing values can lead to inaccurate or biased results. This is relevant since many statistical methods and machine learning algorithms require complete data to function correctly. The proportion of missing data (8%) is relatively small, making the exclusion of these rows a feasible approach. This minimizes the loss of valuable information while still maintaining a dataset of sufficient size for meaningful analysis.
load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/HELIX.RData")
filtered_chem_diet <- codebook %>%
filter(domain %in% c("Chemicals", "Lifestyles") & period == "Postnatal" & subfamily != "Allergens")
# specific covariates
filtered_covariates <- codebook %>%
filter(domain == "Covariates" &
variable_name %in% c("ID", "e3_sex_None", "e3_yearbir_None", "hs_child_age_None"))
#specific phenotype variables
filtered_phenotype <- codebook %>%
filter(domain == "Phenotype" &
variable_name %in% c("hs_zbmi_who"))
# combining all necessary variables together
combined_codebook <- bind_rows(filtered_chem_diet, filtered_covariates, filtered_phenotype)
load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/metabol_serum.RData")
metabol_serum_transposed <- as.data.frame(t(metabol_serum.d))
metabol_serum_transposed$ID <- as.integer(rownames(metabol_serum_transposed))
# add the ID column to the first position
metabol_serum_transposed <- metabol_serum_transposed[, c("ID", setdiff(names(metabol_serum_transposed), "ID"))]
# ID is the first column, and the layout is preserved
kable(head(metabol_serum_transposed), align = "c", digits = 2, format = "pipe")
| ID | metab_1 | metab_2 | metab_3 | metab_4 | metab_5 | metab_6 | metab_7 | metab_8 | metab_9 | metab_10 | metab_11 | metab_12 | metab_13 | metab_14 | metab_15 | metab_16 | metab_17 | metab_18 | metab_19 | metab_20 | metab_21 | metab_22 | metab_23 | metab_24 | metab_25 | metab_26 | metab_27 | metab_28 | metab_29 | metab_30 | metab_31 | metab_32 | metab_33 | metab_34 | metab_35 | metab_36 | metab_37 | metab_38 | metab_39 | metab_40 | metab_41 | metab_42 | metab_43 | metab_44 | metab_45 | metab_46 | metab_47 | metab_48 | metab_49 | metab_50 | metab_51 | metab_52 | metab_53 | metab_54 | metab_55 | metab_56 | metab_57 | metab_58 | metab_59 | metab_60 | metab_61 | metab_62 | metab_63 | metab_64 | metab_65 | metab_66 | metab_67 | metab_68 | metab_69 | metab_70 | metab_71 | metab_72 | metab_73 | metab_74 | metab_75 | metab_76 | metab_77 | metab_78 | metab_79 | metab_80 | metab_81 | metab_82 | metab_83 | metab_84 | metab_85 | metab_86 | metab_87 | metab_88 | metab_89 | metab_90 | metab_91 | metab_92 | metab_93 | metab_94 | metab_95 | metab_96 | metab_97 | metab_98 | metab_99 | metab_100 | metab_101 | metab_102 | metab_103 | metab_104 | metab_105 | metab_106 | metab_107 | metab_108 | metab_109 | metab_110 | metab_111 | metab_112 | metab_113 | metab_114 | metab_115 | metab_116 | metab_117 | metab_118 | metab_119 | metab_120 | metab_121 | metab_122 | metab_123 | metab_124 | metab_125 | metab_126 | metab_127 | metab_128 | metab_129 | metab_130 | metab_131 | metab_132 | metab_133 | metab_134 | metab_135 | metab_136 | metab_137 | metab_138 | metab_139 | metab_140 | metab_141 | metab_142 | metab_143 | metab_144 | metab_145 | metab_146 | metab_147 | metab_148 | metab_149 | metab_150 | metab_151 | metab_152 | metab_153 | metab_154 | metab_155 | metab_156 | metab_157 | metab_158 | metab_159 | metab_160 | metab_161 | metab_162 | metab_163 | metab_164 | metab_165 | metab_166 | metab_167 | metab_168 | metab_169 | metab_170 | metab_171 | metab_172 | metab_173 | metab_174 | metab_175 | metab_176 | metab_177 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 430 | 430 | -2.15 | -0.71 | 8.60 | 0.55 | 7.05 | 5.79 | 3.75 | 5.07 | -1.87 | -2.77 | -3.31 | -2.91 | -2.94 | -1.82 | -4.40 | -4.10 | -5.41 | -5.13 | -5.35 | -3.39 | -5.08 | -6.06 | -6.06 | -4.99 | -4.46 | -4.63 | -3.27 | -4.61 | 2.17 | -1.73 | -4.97 | -4.90 | -2.63 | -5.29 | -2.38 | -4.06 | -5.11 | -5.35 | -4.80 | -3.92 | -3.92 | -5.47 | -4.22 | -2.56 | -3.93 | 5.15 | 6.03 | 10.20 | 5.14 | 7.82 | 12.31 | 7.27 | 7.08 | 1.79 | 7.73 | 7.98 | 1.96 | 6.15 | 0.98 | 0.60 | 4.42 | 4.36 | 5.85 | 1.03 | 2.74 | -2.53 | -2.05 | -2.91 | -1.61 | -1.63 | 5.03 | 0.14 | 6.23 | -2.95 | 1.29 | 1.70 | -2.83 | 4.55 | 4.05 | 2.56 | -0.29 | 8.33 | 9.93 | 4.89 | 1.28 | 2.16 | 5.82 | 8.95 | 7.72 | 8.41 | 4.71 | 0.10 | 2.02 | 0.16 | 5.82 | 7.45 | 6.17 | 6.81 | -0.70 | -1.25 | -0.65 | 2.05 | 3.39 | 4.94 | -0.69 | -1.44 | -2.06 | -2.44 | -1.30 | -0.73 | -1.52 | -2.43 | -3.26 | 1.97 | 0.03 | 1.09 | 3.98 | 4.56 | 4.16 | 0.42 | 3.48 | 4.88 | 3.84 | 4.70 | 4.04 | 1.58 | -0.76 | 1.75 | 2.48 | 4.43 | 4.68 | 3.29 | 0.97 | 1.03 | 0.44 | 1.55 | 2.26 | 2.72 | 0.12 | -0.90 | -0.50 | 0.02 | -0.18 | 1.02 | -2.69 | -1.66 | 0.47 | 0.28 | 6.75 | 7.67 | -2.66 | -1.52 | 7.28 | -0.08 | 2.39 | 1.55 | 3.01 | 2.92 | -0.48 | 6.78 | 3.90 | 4.05 | 3.17 | -1.46 | 3.56 | 4.60 | -3.55 | -2.79 | -1.98 | -1.84 | 3.98 | 6.47 | 7.16 | -0.01 | 6.57 | 6.86 | 8.36 |
| 1187 | 1187 | -0.69 | -0.37 | 9.15 | -1.33 | 6.89 | 5.81 | 4.26 | 5.08 | -2.30 | -3.42 | -3.63 | -3.16 | -3.22 | -1.57 | -4.10 | -5.35 | -5.68 | -6.11 | -5.54 | -3.50 | -5.24 | -5.72 | -5.97 | -4.94 | -4.25 | -4.46 | -3.55 | -4.64 | 1.81 | -2.92 | -4.44 | -4.49 | -3.53 | -4.94 | -3.15 | -4.13 | -4.47 | -4.90 | -4.24 | -3.49 | -3.94 | -4.99 | -4.02 | -2.69 | -3.69 | 5.13 | 5.57 | 9.93 | 6.13 | 8.47 | 12.32 | 6.83 | 5.94 | 1.64 | 6.82 | 7.74 | 1.98 | 6.11 | 0.99 | 0.19 | 4.34 | 4.36 | 5.47 | 0.92 | 2.69 | -2.69 | -1.93 | -2.79 | -1.63 | -1.69 | 4.58 | 0.41 | 6.14 | -3.06 | 1.05 | 2.10 | -2.95 | 4.51 | 4.30 | 2.57 | 0.08 | 8.27 | 9.54 | 4.61 | 1.39 | 1.91 | 5.91 | 8.59 | 7.34 | 8.04 | 4.29 | -0.04 | 2.17 | 0.42 | 5.39 | 6.95 | 5.68 | 6.09 | -0.68 | -1.29 | -0.76 | 1.84 | 3.06 | 4.40 | -0.52 | -1.52 | -1.90 | -2.44 | -1.46 | -1.00 | -1.33 | -2.41 | -3.67 | 2.48 | 0.27 | 1.02 | 4.19 | 4.43 | 4.19 | 0.33 | 3.24 | 4.38 | 3.92 | 5.09 | 4.42 | 1.01 | -0.53 | 1.36 | 2.25 | 4.54 | 5.10 | 3.45 | 0.65 | 0.83 | 0.36 | 1.68 | 2.56 | 2.70 | 0.02 | -1.02 | -0.93 | -0.22 | 0.11 | 1.60 | -2.70 | -1.31 | 1.08 | 0.54 | 6.29 | 7.97 | -3.22 | -1.34 | 7.50 | 0.48 | 2.19 | 1.49 | 3.09 | 2.71 | -0.38 | 6.86 | 3.77 | 4.31 | 3.23 | -1.82 | 3.80 | 5.05 | -3.31 | -2.18 | -2.21 | -2.01 | 4.91 | 6.84 | 7.14 | 0.14 | 6.03 | 6.55 | 7.91 |
| 940 | 940 | -0.69 | -0.36 | 8.95 | -0.13 | 7.10 | 5.86 | 4.35 | 5.92 | -1.97 | -3.40 | -3.41 | -2.99 | -3.01 | -1.65 | -3.55 | -4.82 | -5.41 | -5.84 | -5.13 | -2.83 | -4.86 | -5.51 | -5.51 | -4.63 | -3.73 | -4.00 | -2.92 | -4.21 | 2.79 | -1.41 | -4.80 | -5.47 | -2.10 | -5.47 | -2.14 | -4.18 | -4.84 | -5.24 | -4.64 | -3.20 | -3.90 | -5.24 | -3.77 | -2.70 | -2.76 | 5.21 | 5.86 | 9.78 | 6.38 | 8.29 | 12.49 | 7.01 | 6.49 | 1.97 | 7.17 | 7.62 | 2.40 | 6.93 | 1.85 | 1.45 | 5.11 | 5.30 | 6.27 | 2.35 | 3.31 | -2.50 | -1.41 | -2.61 | -0.93 | -1.03 | 4.54 | 1.59 | 6.03 | -2.74 | 1.79 | 2.68 | -8.16 | 5.19 | 5.14 | 3.16 | 0.24 | 9.09 | 10.25 | 5.44 | 1.90 | 2.46 | 6.66 | 9.19 | 8.24 | 8.46 | 5.73 | 1.10 | 2.58 | 1.15 | 6.37 | 7.28 | 6.51 | 7.20 | -0.48 | -0.69 | -0.02 | 2.56 | 3.76 | 5.33 | -0.16 | -1.18 | -1.18 | -2.16 | -1.06 | -0.19 | -0.48 | -2.35 | -3.16 | 2.79 | 0.72 | 2.14 | 4.80 | 4.84 | 4.55 | 1.27 | 4.26 | 5.23 | 4.40 | 5.43 | 4.56 | 2.32 | 0.03 | 2.15 | 3.22 | 5.06 | 5.28 | 3.80 | 1.38 | 1.58 | 0.98 | 2.27 | 2.94 | 3.39 | 0.33 | -0.53 | 0.17 | 0.53 | 0.57 | 1.69 | -2.21 | -0.76 | 1.25 | 0.49 | 6.49 | 8.84 | -4.02 | -1.33 | 7.42 | 0.71 | 2.81 | 2.03 | 3.30 | 3.00 | -0.24 | 7.02 | 3.82 | 4.66 | 3.36 | -1.18 | 3.82 | 4.91 | -2.95 | -2.89 | -2.43 | -2.05 | 4.25 | 7.02 | 7.36 | 0.14 | 6.57 | 6.68 | 8.12 |
| 936 | 936 | -0.19 | -0.34 | 8.54 | -0.62 | 7.01 | 5.95 | 4.24 | 5.41 | -1.89 | -2.84 | -3.38 | -3.11 | -2.94 | -1.45 | -3.83 | -4.43 | -5.61 | -5.41 | -5.54 | -2.94 | -4.78 | -6.06 | -5.88 | -4.70 | -4.82 | -4.46 | -2.66 | -3.82 | 2.85 | -2.70 | -5.16 | -5.47 | -3.31 | -5.61 | -2.80 | -4.11 | -4.97 | -4.86 | -5.01 | -3.63 | -3.78 | -5.29 | -4.17 | -2.49 | -3.65 | 5.31 | 5.60 | 9.87 | 6.67 | 8.05 | 12.33 | 6.72 | 6.42 | 1.25 | 7.28 | 7.37 | 1.99 | 6.28 | 1.17 | 0.50 | 4.52 | 4.43 | 5.54 | 1.30 | 3.08 | -2.92 | -2.16 | -3.18 | -1.66 | -1.63 | 4.55 | 0.53 | 5.73 | -3.27 | 1.30 | 1.70 | -2.57 | 4.53 | 4.14 | 2.61 | -0.18 | 8.32 | 9.62 | 4.82 | 1.58 | 1.99 | 5.82 | 8.59 | 7.58 | 8.39 | 4.68 | 0.36 | 2.01 | -0.31 | 5.71 | 7.35 | 6.22 | 6.66 | -0.70 | -1.42 | -0.62 | 2.13 | 3.54 | 4.85 | -0.72 | -1.53 | -2.04 | -2.37 | -1.38 | -0.96 | -1.57 | -2.91 | -3.60 | 2.37 | 0.21 | 0.92 | 4.05 | 4.27 | 4.33 | 0.24 | 3.38 | 4.45 | 3.71 | 4.74 | 4.44 | 1.51 | -1.73 | 1.51 | 2.27 | 4.37 | 4.89 | 3.40 | 0.66 | 0.83 | 0.27 | 1.50 | 2.30 | 2.60 | 0.14 | -0.90 | -0.99 | -0.53 | -0.30 | 1.14 | -3.06 | -1.69 | 0.39 | 0.19 | 6.21 | 8.05 | -2.75 | -0.87 | 7.79 | 0.87 | 2.48 | 1.62 | 3.28 | 2.93 | -0.41 | 6.91 | 3.75 | 4.38 | 3.20 | -1.07 | 3.81 | 4.89 | -3.36 | -2.40 | -2.06 | -2.03 | 3.99 | 7.36 | 6.94 | 0.14 | 6.26 | 6.47 | 7.98 |
| 788 | 788 | -1.96 | -0.35 | 8.73 | -0.80 | 6.90 | 5.95 | 4.88 | 5.39 | -1.55 | -2.45 | -3.51 | -2.84 | -2.83 | -1.71 | -3.91 | -4.05 | -5.61 | -4.63 | -5.29 | -3.51 | -4.86 | -5.97 | -5.27 | -4.90 | -4.40 | -4.63 | -3.11 | -3.99 | 2.87 | -2.23 | -4.61 | -5.04 | -3.53 | -5.08 | -3.02 | -4.41 | -4.72 | -5.18 | -4.72 | -3.63 | -3.61 | -5.29 | -4.05 | -2.31 | -3.73 | 4.69 | 5.31 | 9.69 | 6.76 | 8.21 | 12.18 | 6.75 | 6.51 | 1.15 | 7.38 | 7.93 | 1.76 | 5.68 | -0.02 | -0.65 | 4.14 | 3.36 | 4.43 | 0.21 | 1.98 | -2.31 | -1.54 | -2.30 | -1.66 | -1.47 | 4.48 | 0.88 | 6.47 | -2.50 | 0.74 | 1.12 | -2.17 | 4.31 | 3.50 | 2.09 | -0.60 | 8.06 | 9.69 | 3.99 | 0.54 | 1.60 | 5.60 | 8.71 | 7.32 | 8.03 | 3.27 | -0.98 | 1.59 | -0.20 | 5.68 | 7.16 | 5.57 | 6.16 | -0.79 | -1.31 | -0.87 | 2.17 | 3.23 | 4.57 | -0.93 | -1.80 | -2.27 | -2.51 | -1.74 | -1.02 | -1.92 | -2.02 | -3.79 | 1.95 | -0.24 | 0.40 | 3.73 | 4.13 | 3.71 | 0.03 | 2.89 | 4.06 | 3.54 | 4.76 | 3.88 | 0.53 | -2.11 | 1.27 | 1.99 | 4.13 | 4.58 | 2.88 | 0.22 | 0.39 | 0.22 | 1.44 | 2.02 | 2.22 | 0.00 | -0.81 | -1.10 | -0.41 | -0.09 | 1.00 | -2.66 | -1.55 | 0.33 | 0.19 | 6.47 | 7.89 | -4.40 | -1.94 | 7.65 | 0.38 | 1.66 | 0.84 | 2.78 | 2.26 | -0.84 | 6.52 | 3.53 | 3.81 | 2.83 | -1.69 | 3.65 | 4.47 | -3.81 | -2.97 | -2.88 | -2.29 | 3.88 | 6.99 | 7.38 | -0.10 | 6.00 | 6.52 | 8.04 |
| 698 | 698 | -1.90 | -0.63 | 8.24 | -0.46 | 6.94 | 5.42 | 4.70 | 4.62 | -1.78 | -3.14 | -3.46 | -2.90 | -2.94 | -1.65 | -4.20 | -4.56 | -5.68 | -5.61 | -5.41 | -2.92 | -5.04 | -5.97 | -6.06 | -4.90 | -4.22 | -4.20 | -3.05 | -4.61 | 2.15 | -2.87 | -4.68 | -5.08 | -3.69 | -5.24 | -3.63 | -4.24 | -5.16 | -5.35 | -4.97 | -3.61 | -3.99 | -5.35 | -3.98 | -2.59 | -3.95 | 5.15 | 5.82 | 10.00 | 5.54 | 8.15 | 12.28 | 6.80 | 6.23 | 1.88 | 7.07 | 7.38 | 2.06 | 6.79 | 1.67 | 1.00 | 4.79 | 4.79 | 5.71 | 1.99 | 3.29 | -2.13 | -1.01 | -1.85 | -1.23 | -0.90 | 4.41 | -0.02 | 6.09 | -2.10 | 1.66 | 2.27 | -3.48 | 4.96 | 4.76 | 2.64 | 0.05 | 8.91 | 9.99 | 5.16 | 1.53 | 2.11 | 6.28 | 8.77 | 8.03 | 8.66 | 5.99 | 0.87 | 2.30 | 0.63 | 6.23 | 7.50 | 6.75 | 7.22 | -0.45 | -0.81 | -0.11 | 2.57 | 3.93 | 5.16 | -0.31 | -1.19 | -1.25 | -1.93 | -0.89 | 0.07 | -0.87 | -1.12 | -3.03 | 2.61 | 0.54 | 1.83 | 4.50 | 4.53 | 4.42 | 1.15 | 4.02 | 4.91 | 4.06 | 5.06 | 4.42 | 2.02 | -1.03 | 1.87 | 2.96 | 4.84 | 5.08 | 3.62 | 1.13 | 1.23 | 0.75 | 2.26 | 2.80 | 3.04 | 0.41 | -0.39 | 0.02 | 0.31 | 0.52 | 1.73 | -2.28 | -0.73 | 1.06 | 0.72 | 6.44 | 7.27 | -3.08 | -1.23 | 7.35 | 0.92 | 2.60 | 2.00 | 3.69 | 3.20 | -0.25 | 7.38 | 4.15 | 5.00 | 3.88 | -1.39 | 4.31 | 5.20 | -3.47 | -2.75 | -1.97 | -1.96 | 4.18 | 6.81 | 6.75 | 0.02 | 6.49 | 5.97 | 7.78 |
# specific covariates
load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/HELIX.RData")
filtered_chem_diet <- codebook %>%
filter(domain %in% c("Chemicals", "Lifestyles") & period == "Postnatal" & subfamily != "Allergens")
# specific covariates
filtered_covariates <- codebook %>%
filter(domain == "Covariates" &
variable_name %in% c("ID", "e3_sex_None", "e3_yearbir_None", "hs_child_age_None"))
#specific phenotype variables
filtered_phenotype <- codebook %>%
filter(domain == "Phenotype" &
variable_name %in% c("hs_zbmi_who"))
# combining all necessary variables together
combined_codebook <- bind_rows(filtered_chem_diet, filtered_covariates, filtered_phenotype)
outcome_BMI <- phenotype %>%
dplyr::select(hs_zbmi_who)
outcome_and_cov <- cbind(covariates, outcome_BMI)
outcome_and_cov <- outcome_and_cov[, !duplicated(colnames(outcome_and_cov))]
outcome_and_cov <- outcome_and_cov %>%
dplyr::select(ID, hs_child_age_None, e3_sex_None, e3_yearbir_None, hs_zbmi_who)
#the full chemicals list
chemicals_specific <- c(
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_prpa_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2"
)
#postnatal diet for child
postnatal_diet <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_readymade_Ter",
"hs_total_bread_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_potatoes_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter"
)
covariates_selected <- c("hs_child_age_None", "e3_sex_None", "e3_yearbir_None")
all_columns <- c(chemicals_specific, postnatal_diet)
extracted_exposome <- exposome %>% dplyr::select(all_of(all_columns))
selected_id_data <- cbind(outcome_and_cov, extracted_exposome)
# ID is the common identifier in both datasets
combined_data <- merge(selected_id_data, metabol_serum_transposed, by = "ID", all = TRUE)
selected_metabolomics_data <- combined_data %>% dplyr::select(-c(ID))
# the number of NAs in each column
missing_summary <- sapply(selected_metabolomics_data, function(x) sum(is.na(x)))
missing_summary_df <- data.frame(
Variable = names(missing_summary),
Missing_Count = missing_summary,
Missing_Percentage = (missing_summary / nrow(selected_metabolomics_data)) * 100
)
missing_summary_df <- missing_summary_df %>% filter(Missing_Count > 0)
missing_summary_df
selected_metabolomics_data <- selected_metabolomics_data %>% na.omit() #gets rid of 103 rows from original (8% of original data) since there are areas that have NA throughout metabol_#
These variables were categorized into tertiles to assess the impact of different levels of dietary intake on BMI Z-scores. The dietary intake variables included:
h_bfdur_Ter: Duration of breastfeeding
hs_bakery_prod_Ter: Intake of bakery
products
hs_break_cer_Ter: Intake of breakfast
cereals
hs_dairy_Ter: Intake of dairy products
hs_fastfood_Ter: Intake of fast food
hs_org_food_Ter: Intake of organic food
hs_proc_meat_Ter: Intake of processed meat
hs_total_fish_Ter: Intake of total fish
hs_total_fruits_Ter: Intake of total fruits
hs_total_lipids_Ter: Intake of total lipids
hs_total_sweets_Ter: Intake of total sweets
hs_total_veg_Ter: Intake of total
vegetables
# specific lifestyle exposures
dietary_exposures <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_break_cer_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter"
)
dietary_exposome <- dplyr::select(exposome, all_of(dietary_exposures))
summarytools::view(dfSummary(dietary_exposome, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | h_bfdur_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 2 | hs_bakery_prod_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 3 | hs_break_cer_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 4 | hs_dairy_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 5 | hs_fastfood_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 6 | hs_org_food_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 7 | hs_proc_meat_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 8 | hs_total_fish_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 9 | hs_total_fruits_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 10 | hs_total_lipids_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 11 | hs_total_sweets_Ter [factor] |
|
|
0 (0.0%) | ||||||||||||||||
| 12 | hs_total_veg_Ter [factor] |
|
|
0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.4.0)
2024-08-05
This study included a comprehensive assessment of various chemical exposures to understand their impact on childhood BMI Z-scores. The specific chemical exposures selected for analysis were as follows:
hs_cd_c_Log2: Cadmium concentration
hs_co_c_Log2: Cobalt concentration
hs_cs_c_Log2: Cesium concentration
hs_cu_c_Log2: Copper concentration
hs_hg_c_Log2: Mercury concentration
hs_mo_c_Log2: Molybdenum concentration
hs_pb_c_Log2: Lead concentration
hs_dde_cadj_Log2: DDE (a breakdown product of DDT)
concentration, adjusted
hs_pcb153_cadj_Log2: PCB 153 concentration,
adjusted
hs_pcb170_cadj_Log2: PCB 170 concentration,
adjusted
hs_dep_cadj_Log2: DEP (Diethyl phthalate)
concentration, adjusted
hs_pbde153_cadj_Log2: PBDE 153 (Polybrominated
diphenyl ether) concentration, adjusted
hs_pfhxs_c_Log2: PFHxS (Perfluorohexane sulfonic
acid) concentration
hs_pfoa_c_Log2: PFOA (Perfluorooctanoic acid)
concentration
hs_pfos_c_Log2: PFOS (Perfluorooctane sulfonic acid)
concentration
hs_prpa_cadj_Log2: PRPA (Propargyl alcohol)
concentration, adjusted
hs_mbzp_cadj_Log2: MBzP (Mono-benzyl phthalate)
concentration, adjusted
hs_mibp_cadj_Log2: MiBP (Mono-isobutyl phthalate)
concentration, adjusted
hs_mnbp_cadj_Log2: MnBP (Mono-n-butyl phthalate)
concentration, adjusted
# specific chemical exposures
chemical_exposures <- c(
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_prpa_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2"
)
chemical_exposome <- dplyr::select(exposome, all_of(chemical_exposures))
summarytools::view(dfSummary(chemical_exposome, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||
|---|---|---|---|---|---|---|---|---|---|
| 1 | hs_cd_c_Log2 [numeric] |
|
695 distinct values | 0 (0.0%) | |||||
| 2 | hs_co_c_Log2 [numeric] |
|
317 distinct values | 0 (0.0%) | |||||
| 3 | hs_cs_c_Log2 [numeric] |
|
369 distinct values | 0 (0.0%) | |||||
| 4 | hs_cu_c_Log2 [numeric] |
|
345 distinct values | 0 (0.0%) | |||||
| 5 | hs_hg_c_Log2 [numeric] |
|
698 distinct values | 0 (0.0%) | |||||
| 6 | hs_mo_c_Log2 [numeric] |
|
593 distinct values | 0 (0.0%) | |||||
| 7 | hs_pb_c_Log2 [numeric] |
|
529 distinct values | 0 (0.0%) | |||||
| 8 | hs_dde_cadj_Log2 [numeric] |
|
1050 distinct values | 0 (0.0%) | |||||
| 9 | hs_pcb153_cadj_Log2 [numeric] |
|
1047 distinct values | 0 (0.0%) | |||||
| 10 | hs_pcb170_cadj_Log2 [numeric] |
|
1039 distinct values | 0 (0.0%) | |||||
| 11 | hs_dep_cadj_Log2 [numeric] |
|
1045 distinct values | 0 (0.0%) | |||||
| 12 | hs_pbde153_cadj_Log2 [numeric] |
|
1036 distinct values | 0 (0.0%) | |||||
| 13 | hs_pfhxs_c_Log2 [numeric] |
|
1061 distinct values | 0 (0.0%) | |||||
| 14 | hs_pfoa_c_Log2 [numeric] |
|
1061 distinct values | 0 (0.0%) | |||||
| 15 | hs_pfos_c_Log2 [numeric] |
|
1050 distinct values | 0 (0.0%) | |||||
| 16 | hs_prpa_cadj_Log2 [numeric] |
|
1031 distinct values | 0 (0.0%) | |||||
| 17 | hs_mbzp_cadj_Log2 [numeric] |
|
1046 distinct values | 0 (0.0%) | |||||
| 18 | hs_mibp_cadj_Log2 [numeric] |
|
1057 distinct values | 0 (0.0%) | |||||
| 19 | hs_mnbp_cadj_Log2 [numeric] |
|
1048 distinct values | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.4.0)
2024-08-05
Covariates were selected on its impact with the postnatal nature of the child. The only exception is sex and year of birth, which was considered as pregnancy. This were used since there are differences amongst gender as well as depending on the child’s age and when they are born.
# Specified covariates
specific_covariates <- c(
"e3_sex_None",
"e3_yearbir_None",
"hs_child_age_None"
)
covariate_data <- dplyr::select(covariates, all_of(specific_covariates))
summarytools::view(dfSummary(covariate_data, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | e3_sex_None [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||
| 2 | e3_yearbir_None [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||
| 3 | hs_child_age_None [numeric] |
|
879 distinct values | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.4.0)
2024-08-05
The primary outcome of interest in this analysis is the Body Mass Index (BMI) Z-score of children, which is a measure of relative weight adjusted for child age and sex. The BMI Z-score provides a standardized way to compare a child’s BMI with a reference population
outcome_BMI <- phenotype %>%
dplyr::select(hs_zbmi_who)
summarytools::view(dfSummary(outcome_BMI, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||
|---|---|---|---|---|---|---|---|---|---|
| 1 | hs_zbmi_who [numeric] |
|
421 distinct values | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.4.0)
2024-08-05
combined_data <- cbind(covariate_data, dietary_exposome, chemical_exposome, outcome_BMI)
combined_data <- combined_data[, !duplicated(colnames(combined_data))]
# sex variable to a factor for stratification
combined_data$e3_sex_None <- as.factor(combined_data$e3_sex_None)
levels(combined_data$e3_sex_None) <- c("Male", "Female")
render_cont <- function(x) {
with(stats.default(x), sprintf("%0.2f (%0.2f)", MEAN, SD))
}
render_cat <- function(x) {
c("", sapply(stats.default(x), function(y) with(y, sprintf("%d (%0.1f %%)", FREQ, PCT))))
}
table1_formula <- ~
hs_child_age_None + e3_yearbir_None +
hs_zbmi_who +
h_bfdur_Ter + hs_bakery_prod_Ter + hs_break_cer_Ter + hs_dairy_Ter + hs_fastfood_Ter + hs_org_food_Ter +
hs_proc_meat_Ter +
hs_total_fish_Ter + hs_total_fruits_Ter + hs_total_lipids_Ter + hs_total_sweets_Ter + hs_total_veg_Ter +
hs_cd_c_Log2 + hs_co_c_Log2 + hs_cs_c_Log2 + hs_cu_c_Log2 +
hs_hg_c_Log2 + hs_mo_c_Log2 + hs_dde_cadj_Log2 + hs_pcb153_cadj_Log2 +
hs_pcb170_cadj_Log2 + hs_dep_cadj_Log2 + hs_pbde153_cadj_Log2 +
hs_pfhxs_c_Log2 + hs_pfoa_c_Log2 + hs_pfos_c_Log2 + hs_prpa_cadj_Log2 +
hs_mbzp_cadj_Log2 + hs_mibp_cadj_Log2 + hs_mnbp_cadj_Log2 | e3_sex_None
table1(
table1_formula,
data = combined_data,
render.continuous = render_cont,
render.categorical = render_cat,
overall = TRUE,
topclass = "Rtable1-shade"
)
| Male (N=608) |
Female (N=693) |
TRUE (N=1301) |
|
|---|---|---|---|
| hs_child_age_None | 7.91 (1.58) | 8.03 (1.64) | 7.98 (1.61) |
| e3_yearbir_None | |||
| 2003 | 25 (4.1 %) | 30 (4.3 %) | 55 (4.2 %) |
| 2004 | 46 (7.6 %) | 61 (8.8 %) | 107 (8.2 %) |
| 2005 | 121 (19.9 %) | 120 (17.3 %) | 241 (18.5 %) |
| 2006 | 108 (17.8 %) | 148 (21.4 %) | 256 (19.7 %) |
| 2007 | 128 (21.1 %) | 122 (17.6 %) | 250 (19.2 %) |
| 2008 | 177 (29.1 %) | 202 (29.1 %) | 379 (29.1 %) |
| 2009 | 3 (0.5 %) | 10 (1.4 %) | 13 (1.0 %) |
| hs_zbmi_who | 0.35 (1.15) | 0.45 (1.22) | 0.40 (1.19) |
| h_bfdur_Ter | |||
| (0,10.8] | 231 (38.0 %) | 275 (39.7 %) | 506 (38.9 %) |
| (10.8,34.9] | 118 (19.4 %) | 152 (21.9 %) | 270 (20.8 %) |
| (34.9,Inf] | 259 (42.6 %) | 266 (38.4 %) | 525 (40.4 %) |
| hs_bakery_prod_Ter | |||
| (0,2] | 164 (27.0 %) | 181 (26.1 %) | 345 (26.5 %) |
| (2,6] | 188 (30.9 %) | 235 (33.9 %) | 423 (32.5 %) |
| (6,Inf] | 256 (42.1 %) | 277 (40.0 %) | 533 (41.0 %) |
| hs_break_cer_Ter | |||
| (0,1.1] | 141 (23.2 %) | 150 (21.6 %) | 291 (22.4 %) |
| (1.1,5.5] | 251 (41.3 %) | 270 (39.0 %) | 521 (40.0 %) |
| (5.5,Inf] | 216 (35.5 %) | 273 (39.4 %) | 489 (37.6 %) |
| hs_dairy_Ter | |||
| (0,14.6] | 175 (28.8 %) | 184 (26.6 %) | 359 (27.6 %) |
| (14.6,25.6] | 229 (37.7 %) | 236 (34.1 %) | 465 (35.7 %) |
| (25.6,Inf] | 204 (33.6 %) | 273 (39.4 %) | 477 (36.7 %) |
| hs_fastfood_Ter | |||
| (0,0.132] | 75 (12.3 %) | 68 (9.8 %) | 143 (11.0 %) |
| (0.132,0.5] | 273 (44.9 %) | 330 (47.6 %) | 603 (46.3 %) |
| (0.5,Inf] | 260 (42.8 %) | 295 (42.6 %) | 555 (42.7 %) |
| hs_org_food_Ter | |||
| (0,0.132] | 211 (34.7 %) | 218 (31.5 %) | 429 (33.0 %) |
| (0.132,1] | 191 (31.4 %) | 205 (29.6 %) | 396 (30.4 %) |
| (1,Inf] | 206 (33.9 %) | 270 (39.0 %) | 476 (36.6 %) |
| hs_proc_meat_Ter | |||
| (0,1.5] | 175 (28.8 %) | 191 (27.6 %) | 366 (28.1 %) |
| (1.5,4] | 227 (37.3 %) | 244 (35.2 %) | 471 (36.2 %) |
| (4,Inf] | 206 (33.9 %) | 258 (37.2 %) | 464 (35.7 %) |
| hs_total_fish_Ter | |||
| (0,1.5] | 183 (30.1 %) | 206 (29.7 %) | 389 (29.9 %) |
| (1.5,3] | 224 (36.8 %) | 230 (33.2 %) | 454 (34.9 %) |
| (3,Inf] | 201 (33.1 %) | 257 (37.1 %) | 458 (35.2 %) |
| hs_total_fruits_Ter | |||
| (0,7] | 174 (28.6 %) | 239 (34.5 %) | 413 (31.7 %) |
| (7,14.1] | 216 (35.5 %) | 191 (27.6 %) | 407 (31.3 %) |
| (14.1,Inf] | 218 (35.9 %) | 263 (38.0 %) | 481 (37.0 %) |
| hs_total_lipids_Ter | |||
| (0,3] | 193 (31.7 %) | 204 (29.4 %) | 397 (30.5 %) |
| (3,7] | 171 (28.1 %) | 232 (33.5 %) | 403 (31.0 %) |
| (7,Inf] | 244 (40.1 %) | 257 (37.1 %) | 501 (38.5 %) |
| hs_total_sweets_Ter | |||
| (0,4.1] | 149 (24.5 %) | 195 (28.1 %) | 344 (26.4 %) |
| (4.1,8.5] | 251 (41.3 %) | 265 (38.2 %) | 516 (39.7 %) |
| (8.5,Inf] | 208 (34.2 %) | 233 (33.6 %) | 441 (33.9 %) |
| hs_total_veg_Ter | |||
| (0,6] | 190 (31.2 %) | 214 (30.9 %) | 404 (31.1 %) |
| (6,8.5] | 136 (22.4 %) | 178 (25.7 %) | 314 (24.1 %) |
| (8.5,Inf] | 282 (46.4 %) | 301 (43.4 %) | 583 (44.8 %) |
| hs_cd_c_Log2 | -3.99 (0.98) | -3.95 (1.09) | -3.97 (1.04) |
| hs_co_c_Log2 | -2.37 (0.61) | -2.32 (0.64) | -2.34 (0.63) |
| hs_cs_c_Log2 | 0.44 (0.58) | 0.44 (0.57) | 0.44 (0.57) |
| hs_cu_c_Log2 | 9.81 (0.25) | 9.84 (0.22) | 9.83 (0.23) |
| hs_hg_c_Log2 | -0.24 (1.59) | -0.35 (1.75) | -0.30 (1.68) |
| hs_mo_c_Log2 | -0.32 (0.83) | -0.31 (0.96) | -0.32 (0.90) |
| hs_dde_cadj_Log2 | 4.63 (1.48) | 4.70 (1.50) | 4.67 (1.49) |
| hs_pcb153_cadj_Log2 | 3.47 (0.86) | 3.63 (0.94) | 3.56 (0.90) |
| hs_pcb170_cadj_Log2 | -0.60 (3.22) | -0.05 (2.77) | -0.31 (3.00) |
| hs_dep_cadj_Log2 | 0.27 (3.16) | 0.06 (3.25) | 0.16 (3.21) |
| hs_pbde153_cadj_Log2 | -4.66 (3.86) | -4.40 (3.80) | -4.53 (3.83) |
| hs_pfhxs_c_Log2 | -1.62 (1.30) | -1.53 (1.31) | -1.57 (1.31) |
| hs_pfoa_c_Log2 | 0.60 (0.55) | 0.62 (0.56) | 0.61 (0.55) |
| hs_pfos_c_Log2 | 0.95 (1.15) | 0.99 (1.08) | 0.97 (1.11) |
| hs_prpa_cadj_Log2 | -1.26 (3.96) | -1.91 (3.68) | -1.61 (3.82) |
| hs_mbzp_cadj_Log2 | 2.42 (1.23) | 2.47 (1.22) | 2.44 (1.22) |
| hs_mibp_cadj_Log2 | 5.54 (1.09) | 5.39 (1.12) | 5.46 (1.11) |
| hs_mnbp_cadj_Log2 | 4.77 (1.08) | 4.60 (0.96) | 4.68 (1.02) |
combined_data$e3_yearbir_None <- as.factor(combined_data$e3_yearbir_None)
render_cont <- function(x) {
with(stats.default(x), sprintf("%0.2f (%0.2f)", MEAN, SD))
}
render_cat <- function(x) {
c("", sapply(stats.default(x), function(y) with(y, sprintf("%d (%0.1f %%)", FREQ, PCT))))
}
table1_formula_year <- ~
hs_child_age_None + e3_sex_None +
hs_zbmi_who +
h_bfdur_Ter + hs_bakery_prod_Ter + hs_break_cer_Ter + hs_dairy_Ter + hs_fastfood_Ter + hs_org_food_Ter +
hs_proc_meat_Ter +
hs_total_fish_Ter + hs_total_fruits_Ter + hs_total_lipids_Ter + hs_total_sweets_Ter + hs_total_veg_Ter +
hs_cd_c_Log2 + hs_co_c_Log2 + hs_cs_c_Log2 + hs_cu_c_Log2 +
hs_hg_c_Log2 + hs_mo_c_Log2 + hs_dde_cadj_Log2 + hs_pcb153_cadj_Log2 +
hs_pcb170_cadj_Log2 + hs_dep_cadj_Log2 + hs_pbde153_cadj_Log2 +
hs_pfhxs_c_Log2 + hs_pfoa_c_Log2 + hs_pfos_c_Log2 + hs_prpa_cadj_Log2 +
hs_mbzp_cadj_Log2 + hs_mibp_cadj_Log2 + hs_mnbp_cadj_Log2 | e3_yearbir_None
table1(
table1_formula_year,
data = combined_data,
render.continuous = render_cont,
render.categorical = render_cat,
overall = TRUE,
topclass = "Rtable1-shade"
)
| 2003 (N=55) |
2004 (N=107) |
2005 (N=241) |
2006 (N=256) |
2007 (N=250) |
2008 (N=379) |
2009 (N=13) |
TRUE (N=1301) |
|
|---|---|---|---|---|---|---|---|---|
| hs_child_age_None | 11.32 (0.47) | 10.81 (0.38) | 9.19 (0.57) | 8.38 (0.41) | 6.91 (0.51) | 6.41 (0.31) | 6.18 (0.15) | 7.98 (1.61) |
| e3_sex_None | ||||||||
| Male | 25 (45.5 %) | 46 (43.0 %) | 121 (50.2 %) | 108 (42.2 %) | 128 (51.2 %) | 177 (46.7 %) | 3 (23.1 %) | 608 (46.7 %) |
| Female | 30 (54.5 %) | 61 (57.0 %) | 120 (49.8 %) | 148 (57.8 %) | 122 (48.8 %) | 202 (53.3 %) | 10 (76.9 %) | 693 (53.3 %) |
| hs_zbmi_who | 0.32 (1.15) | 0.12 (1.17) | 0.45 (1.11) | 0.34 (1.11) | 0.41 (1.22) | 0.49 (1.28) | 0.79 (1.14) | 0.40 (1.19) |
| h_bfdur_Ter | ||||||||
| (0,10.8] | 35 (63.6 %) | 65 (60.7 %) | 92 (38.2 %) | 81 (31.6 %) | 90 (36.0 %) | 138 (36.4 %) | 5 (38.5 %) | 506 (38.9 %) |
| (10.8,34.9] | 15 (27.3 %) | 28 (26.2 %) | 69 (28.6 %) | 42 (16.4 %) | 38 (15.2 %) | 75 (19.8 %) | 3 (23.1 %) | 270 (20.8 %) |
| (34.9,Inf] | 5 (9.1 %) | 14 (13.1 %) | 80 (33.2 %) | 133 (52.0 %) | 122 (48.8 %) | 166 (43.8 %) | 5 (38.5 %) | 525 (40.4 %) |
| hs_bakery_prod_Ter | ||||||||
| (0,2] | 16 (29.1 %) | 21 (19.6 %) | 88 (36.5 %) | 117 (45.7 %) | 48 (19.2 %) | 53 (14.0 %) | 2 (15.4 %) | 345 (26.5 %) |
| (2,6] | 12 (21.8 %) | 30 (28.0 %) | 75 (31.1 %) | 90 (35.2 %) | 77 (30.8 %) | 132 (34.8 %) | 7 (53.8 %) | 423 (32.5 %) |
| (6,Inf] | 27 (49.1 %) | 56 (52.3 %) | 78 (32.4 %) | 49 (19.1 %) | 125 (50.0 %) | 194 (51.2 %) | 4 (30.8 %) | 533 (41.0 %) |
| hs_break_cer_Ter | ||||||||
| (0,1.1] | 16 (29.1 %) | 38 (35.5 %) | 60 (24.9 %) | 62 (24.2 %) | 39 (15.6 %) | 70 (18.5 %) | 6 (46.2 %) | 291 (22.4 %) |
| (1.1,5.5] | 19 (34.5 %) | 36 (33.6 %) | 103 (42.7 %) | 103 (40.2 %) | 103 (41.2 %) | 153 (40.4 %) | 4 (30.8 %) | 521 (40.0 %) |
| (5.5,Inf] | 20 (36.4 %) | 33 (30.8 %) | 78 (32.4 %) | 91 (35.5 %) | 108 (43.2 %) | 156 (41.2 %) | 3 (23.1 %) | 489 (37.6 %) |
| hs_dairy_Ter | ||||||||
| (0,14.6] | 15 (27.3 %) | 21 (19.6 %) | 67 (27.8 %) | 58 (22.7 %) | 73 (29.2 %) | 119 (31.4 %) | 6 (46.2 %) | 359 (27.6 %) |
| (14.6,25.6] | 12 (21.8 %) | 27 (25.2 %) | 90 (37.3 %) | 101 (39.5 %) | 80 (32.0 %) | 149 (39.3 %) | 6 (46.2 %) | 465 (35.7 %) |
| (25.6,Inf] | 28 (50.9 %) | 59 (55.1 %) | 84 (34.9 %) | 97 (37.9 %) | 97 (38.8 %) | 111 (29.3 %) | 1 (7.7 %) | 477 (36.7 %) |
| hs_fastfood_Ter | ||||||||
| (0,0.132] | 6 (10.9 %) | 14 (13.1 %) | 20 (8.3 %) | 21 (8.2 %) | 22 (8.8 %) | 57 (15.0 %) | 3 (23.1 %) | 143 (11.0 %) |
| (0.132,0.5] | 38 (69.1 %) | 45 (42.1 %) | 140 (58.1 %) | 153 (59.8 %) | 94 (37.6 %) | 129 (34.0 %) | 4 (30.8 %) | 603 (46.3 %) |
| (0.5,Inf] | 11 (20.0 %) | 48 (44.9 %) | 81 (33.6 %) | 82 (32.0 %) | 134 (53.6 %) | 193 (50.9 %) | 6 (46.2 %) | 555 (42.7 %) |
| hs_org_food_Ter | ||||||||
| (0,0.132] | 17 (30.9 %) | 30 (28.0 %) | 77 (32.0 %) | 51 (19.9 %) | 96 (38.4 %) | 155 (40.9 %) | 3 (23.1 %) | 429 (33.0 %) |
| (0.132,1] | 20 (36.4 %) | 39 (36.4 %) | 78 (32.4 %) | 99 (38.7 %) | 68 (27.2 %) | 87 (23.0 %) | 5 (38.5 %) | 396 (30.4 %) |
| (1,Inf] | 18 (32.7 %) | 38 (35.5 %) | 86 (35.7 %) | 106 (41.4 %) | 86 (34.4 %) | 137 (36.1 %) | 5 (38.5 %) | 476 (36.6 %) |
| hs_proc_meat_Ter | ||||||||
| (0,1.5] | 6 (10.9 %) | 26 (24.3 %) | 38 (15.8 %) | 37 (14.5 %) | 91 (36.4 %) | 162 (42.7 %) | 6 (46.2 %) | 366 (28.1 %) |
| (1.5,4] | 36 (65.5 %) | 41 (38.3 %) | 80 (33.2 %) | 92 (35.9 %) | 83 (33.2 %) | 134 (35.4 %) | 5 (38.5 %) | 471 (36.2 %) |
| (4,Inf] | 13 (23.6 %) | 40 (37.4 %) | 123 (51.0 %) | 127 (49.6 %) | 76 (30.4 %) | 83 (21.9 %) | 2 (15.4 %) | 464 (35.7 %) |
| hs_total_fish_Ter | ||||||||
| (0,1.5] | 11 (20.0 %) | 20 (18.7 %) | 32 (13.3 %) | 36 (14.1 %) | 106 (42.4 %) | 180 (47.5 %) | 4 (30.8 %) | 389 (29.9 %) |
| (1.5,3] | 31 (56.4 %) | 56 (52.3 %) | 80 (33.2 %) | 70 (27.3 %) | 82 (32.8 %) | 128 (33.8 %) | 7 (53.8 %) | 454 (34.9 %) |
| (3,Inf] | 13 (23.6 %) | 31 (29.0 %) | 129 (53.5 %) | 150 (58.6 %) | 62 (24.8 %) | 71 (18.7 %) | 2 (15.4 %) | 458 (35.2 %) |
| hs_total_fruits_Ter | ||||||||
| (0,7] | 33 (60.0 %) | 58 (54.2 %) | 73 (30.3 %) | 55 (21.5 %) | 68 (27.2 %) | 119 (31.4 %) | 7 (53.8 %) | 413 (31.7 %) |
| (7,14.1] | 13 (23.6 %) | 24 (22.4 %) | 88 (36.5 %) | 76 (29.7 %) | 81 (32.4 %) | 123 (32.5 %) | 2 (15.4 %) | 407 (31.3 %) |
| (14.1,Inf] | 9 (16.4 %) | 25 (23.4 %) | 80 (33.2 %) | 125 (48.8 %) | 101 (40.4 %) | 137 (36.1 %) | 4 (30.8 %) | 481 (37.0 %) |
| hs_total_lipids_Ter | ||||||||
| (0,3] | 9 (16.4 %) | 18 (16.8 %) | 97 (40.2 %) | 82 (32.0 %) | 67 (26.8 %) | 122 (32.2 %) | 2 (15.4 %) | 397 (30.5 %) |
| (3,7] | 26 (47.3 %) | 45 (42.1 %) | 71 (29.5 %) | 59 (23.0 %) | 80 (32.0 %) | 116 (30.6 %) | 6 (46.2 %) | 403 (31.0 %) |
| (7,Inf] | 20 (36.4 %) | 44 (41.1 %) | 73 (30.3 %) | 115 (44.9 %) | 103 (41.2 %) | 141 (37.2 %) | 5 (38.5 %) | 501 (38.5 %) |
| hs_total_sweets_Ter | ||||||||
| (0,4.1] | 16 (29.1 %) | 17 (15.9 %) | 87 (36.1 %) | 96 (37.5 %) | 51 (20.4 %) | 74 (19.5 %) | 3 (23.1 %) | 344 (26.4 %) |
| (4.1,8.5] | 13 (23.6 %) | 38 (35.5 %) | 99 (41.1 %) | 103 (40.2 %) | 104 (41.6 %) | 157 (41.4 %) | 2 (15.4 %) | 516 (39.7 %) |
| (8.5,Inf] | 26 (47.3 %) | 52 (48.6 %) | 55 (22.8 %) | 57 (22.3 %) | 95 (38.0 %) | 148 (39.1 %) | 8 (61.5 %) | 441 (33.9 %) |
| hs_total_veg_Ter | ||||||||
| (0,6] | 19 (34.5 %) | 26 (24.3 %) | 75 (31.1 %) | 63 (24.6 %) | 81 (32.4 %) | 136 (35.9 %) | 4 (30.8 %) | 404 (31.1 %) |
| (6,8.5] | 11 (20.0 %) | 25 (23.4 %) | 53 (22.0 %) | 71 (27.7 %) | 55 (22.0 %) | 95 (25.1 %) | 4 (30.8 %) | 314 (24.1 %) |
| (8.5,Inf] | 25 (45.5 %) | 56 (52.3 %) | 113 (46.9 %) | 122 (47.7 %) | 114 (45.6 %) | 148 (39.1 %) | 5 (38.5 %) | 583 (44.8 %) |
| hs_cd_c_Log2 | -4.32 (1.45) | -3.93 (1.01) | -3.90 (1.15) | -3.91 (1.00) | -3.90 (0.88) | -4.05 (0.97) | -4.09 (1.85) | -3.97 (1.04) |
| hs_co_c_Log2 | -2.35 (0.51) | -2.38 (0.59) | -2.54 (0.60) | -2.45 (0.68) | -2.27 (0.58) | -2.20 (0.62) | -2.11 (0.55) | -2.34 (0.63) |
| hs_cs_c_Log2 | 1.04 (0.47) | 1.03 (0.50) | 0.71 (0.43) | 0.65 (0.43) | 0.18 (0.50) | 0.07 (0.45) | -0.04 (0.40) | 0.44 (0.57) |
| hs_cu_c_Log2 | 9.83 (0.18) | 9.89 (0.30) | 9.79 (0.21) | 9.76 (0.22) | 9.82 (0.22) | 9.88 (0.23) | 9.86 (0.26) | 9.83 (0.23) |
| hs_hg_c_Log2 | 0.56 (1.21) | 0.73 (1.27) | 0.41 (1.35) | 0.12 (1.35) | -0.78 (1.73) | -1.11 (1.70) | -0.75 (1.41) | -0.30 (1.68) |
| hs_mo_c_Log2 | -0.85 (1.80) | -0.43 (0.64) | -0.45 (0.83) | -0.32 (0.81) | -0.26 (0.83) | -0.16 (0.88) | -0.27 (0.94) | -0.32 (0.90) |
| hs_dde_cadj_Log2 | 4.06 (1.34) | 3.90 (1.09) | 4.23 (1.24) | 4.32 (1.02) | 4.96 (1.66) | 5.28 (1.62) | 5.16 (1.23) | 4.67 (1.49) |
| hs_pcb153_cadj_Log2 | 3.54 (0.73) | 3.47 (0.72) | 3.80 (0.84) | 4.03 (0.79) | 3.32 (0.95) | 3.25 (0.87) | 3.69 (0.96) | 3.56 (0.90) |
| hs_pcb170_cadj_Log2 | 0.48 (1.19) | 0.17 (2.28) | 0.64 (2.35) | 1.08 (1.77) | -1.46 (3.58) | -1.32 (3.29) | -0.92 (3.65) | -0.31 (3.00) |
| hs_dep_cadj_Log2 | -0.62 (3.08) | 0.10 (3.10) | 0.13 (3.31) | 0.19 (2.90) | 0.65 (3.10) | -0.02 (3.42) | -0.25 (3.57) | 0.16 (3.21) |
| hs_pbde153_cadj_Log2 | -4.43 (3.35) | -5.26 (3.63) | -4.26 (3.77) | -3.68 (3.60) | -4.30 (3.95) | -5.20 (3.93) | -5.08 (3.90) | -4.53 (3.83) |
| hs_pfhxs_c_Log2 | -0.58 (0.66) | -0.53 (0.85) | -1.01 (0.99) | -1.07 (0.88) | -2.03 (1.40) | -2.37 (1.22) | -2.68 (0.71) | -1.57 (1.31) |
| hs_pfoa_c_Log2 | 0.53 (0.44) | 0.55 (0.53) | 0.67 (0.49) | 0.64 (0.51) | 0.66 (0.63) | 0.55 (0.58) | 0.35 (0.50) | 0.61 (0.55) |
| hs_pfos_c_Log2 | 1.67 (0.68) | 1.55 (0.72) | 1.18 (1.00) | 1.10 (1.14) | 0.80 (1.05) | 0.62 (1.17) | 0.40 (0.94) | 0.97 (1.11) |
| hs_prpa_cadj_Log2 | -3.21 (3.68) | -2.67 (3.04) | -0.92 (3.92) | -1.65 (3.89) | -1.57 (3.80) | -1.58 (3.80) | 0.87 (4.50) | -1.61 (3.82) |
| hs_mbzp_cadj_Log2 | 2.69 (1.08) | 2.84 (1.11) | 2.43 (1.19) | 2.32 (1.14) | 2.36 (1.34) | 2.45 (1.24) | 2.21 (1.51) | 2.44 (1.22) |
| hs_mibp_cadj_Log2 | 5.19 (1.08) | 5.59 (1.07) | 4.84 (0.97) | 4.82 (0.98) | 5.84 (1.00) | 6.02 (0.93) | 6.27 (0.97) | 5.46 (1.11) |
| hs_mnbp_cadj_Log2 | 3.99 (0.86) | 4.34 (0.88) | 4.26 (0.90) | 4.52 (0.90) | 4.85 (0.95) | 5.11 (1.05) | 5.22 (0.77) | 4.68 (1.02) |
When selecting variables, elastic net will be applied into the available diet and chemical variables in the HELIX data. Elastic net was utilized for variable selection and further analysis.
outcome_cov <- cbind(covariate_data, outcome_BMI)
outcome_cov <- outcome_cov[, !duplicated(colnames(outcome_cov))]
#the full chemicals list
chemicals_full <- c(
"hs_as_c_Log2",
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mn_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_tl_cdich_None",
"hs_dde_cadj_Log2",
"hs_ddt_cadj_Log2",
"hs_hcb_cadj_Log2",
"hs_pcb118_cadj_Log2",
"hs_pcb138_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_pcb180_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_detp_cadj_Log2",
"hs_dmdtp_cdich_None",
"hs_dmp_cadj_Log2",
"hs_dmtp_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pbde47_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfna_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_pfunda_c_Log2",
"hs_bpa_cadj_Log2",
"hs_bupa_cadj_Log2",
"hs_etpa_cadj_Log2",
"hs_mepa_cadj_Log2",
"hs_oxbe_cadj_Log2",
"hs_prpa_cadj_Log2",
"hs_trcs_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mecpp_cadj_Log2",
"hs_mehhp_cadj_Log2",
"hs_mehp_cadj_Log2",
"hs_meohp_cadj_Log2",
"hs_mep_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2",
"hs_ohminp_cadj_Log2",
"hs_oxominp_cadj_Log2",
"hs_cotinine_cdich_None",
"hs_globalexp2_None"
)
#postnatal diet for child
postnatal_diet <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_beverages_Ter",
"hs_break_cer_Ter",
"hs_caff_drink_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_readymade_Ter",
"hs_total_bread_Ter",
"hs_total_cereal_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_meat_Ter",
"hs_total_potatoes_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter",
"hs_total_yog_Ter"
)
chemicals_columns <- c(chemicals_full)
all_chemicals <- exposome %>% dplyr::select(all_of(chemicals_columns))
diet_columns <- c(postnatal_diet)
all_diet <- exposome %>% dplyr::select(all_of(diet_columns))
all_columns <- c(chemicals_full, postnatal_diet)
extracted_exposome <- exposome %>% dplyr::select(all_of(all_columns))
chemicals_outcome_cov <- cbind(outcome_cov, all_chemicals)
diet_outcome_cov <- cbind(outcome_cov, all_diet)
interested_data <- cbind(outcome_cov, extracted_exposome)
head(interested_data)
Chemicals will be analyzed for the best variables using enet methods.
# train/test 70-30
set.seed(101)
train_indices <- sample(seq_len(nrow(chemicals_outcome_cov)), size = floor(0.7 * nrow(interested_data)))
test_indices <- setdiff(seq_len(nrow(chemicals_outcome_cov)), train_indices)
x_train <- as.matrix(chemicals_outcome_cov[train_indices, setdiff(names(chemicals_outcome_cov), "hs_zbmi_who")])
y_train <- chemicals_outcome_cov$hs_zbmi_who[train_indices]
x_test <- as.matrix(chemicals_outcome_cov[test_indices, setdiff(names(chemicals_outcome_cov), "hs_zbmi_who")])
y_test <- chemicals_outcome_cov$hs_zbmi_who[test_indices]
x_train_chemicals_only <- as.matrix(chemicals_outcome_cov[train_indices, chemicals_full])
x_test_chemicals_only <- as.matrix(chemicals_outcome_cov[test_indices, chemicals_full])
# ELASTIC NET
fit_without_covariates_train <- cv.glmnet(x_train_chemicals_only, y_train, alpha = 0.5, family = "gaussian")
fit_without_covariates_test <- predict(fit_without_covariates_train, s = "lambda.min", newx = x_test_chemicals_only)
test_mse_without_covariates <- mean((y_test - fit_without_covariates_test)^2)
test_rmse_without_covariates <- sqrt(test_mse_without_covariates)
plot(fit_without_covariates_train, xvar = "lambda", main = "Coefficients Path (Without Covariates)")
best_lambda <- fit_without_covariates_train$lambda.min # lambda that minimizes the MSE
coef(fit_without_covariates_train, s = best_lambda)
## 50 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -4.785950188
## hs_as_c_Log2 .
## hs_cd_c_Log2 -0.025843356
## hs_co_c_Log2 -0.005835867
## hs_cs_c_Log2 0.084715330
## hs_cu_c_Log2 0.607379616
## hs_hg_c_Log2 -0.009800093
## hs_mn_c_Log2 .
## hs_mo_c_Log2 -0.099724922
## hs_pb_c_Log2 -0.010318890
## hs_tl_cdich_None .
## hs_dde_cadj_Log2 -0.039528137
## hs_ddt_cadj_Log2 .
## hs_hcb_cadj_Log2 .
## hs_pcb118_cadj_Log2 .
## hs_pcb138_cadj_Log2 .
## hs_pcb153_cadj_Log2 -0.169008355
## hs_pcb170_cadj_Log2 -0.055808065
## hs_pcb180_cadj_Log2 .
## hs_dep_cadj_Log2 -0.019034348
## hs_detp_cadj_Log2 .
## hs_dmdtp_cdich_None .
## hs_dmp_cadj_Log2 .
## hs_dmtp_cadj_Log2 .
## hs_pbde153_cadj_Log2 -0.035464586
## hs_pbde47_cadj_Log2 .
## hs_pfhxs_c_Log2 -0.006816020
## hs_pfna_c_Log2 .
## hs_pfoa_c_Log2 -0.135997766
## hs_pfos_c_Log2 -0.047692264
## hs_pfunda_c_Log2 .
## hs_bpa_cadj_Log2 .
## hs_bupa_cadj_Log2 .
## hs_etpa_cadj_Log2 .
## hs_mepa_cadj_Log2 .
## hs_oxbe_cadj_Log2 0.002529961
## hs_prpa_cadj_Log2 0.001735800
## hs_trcs_cadj_Log2 .
## hs_mbzp_cadj_Log2 0.040317847
## hs_mecpp_cadj_Log2 .
## hs_mehhp_cadj_Log2 .
## hs_mehp_cadj_Log2 .
## hs_meohp_cadj_Log2 .
## hs_mep_cadj_Log2 .
## hs_mibp_cadj_Log2 -0.047892677
## hs_mnbp_cadj_Log2 -0.008483913
## hs_ohminp_cadj_Log2 .
## hs_oxominp_cadj_Log2 .
## hs_cotinine_cdich_None .
## hs_globalexp2_None .
cat("Model without Covariates - Test RMSE:", test_rmse_without_covariates)
## Model without Covariates - Test RMSE: 1.108515
#selected chemicals that were noted in enet
chemicals_selected <- c(
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_detp_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_mepa_cadj_Log2",
"hs_oxbe_cadj_Log2",
"hs_prpa_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2")
The features for chemicals were selected due to the feature selections of elastic net. LASSO might simplify the dimensionality, so elastic net was chosen since feature importance is uncertain.
Like with the chemical variables, the postnatal diet of children will be analyzed for the best variables using the regression methods.
# train/test
set.seed(101)
train_indices <- sample(seq_len(nrow(diet_outcome_cov)), size = floor(0.7 * nrow(diet_outcome_cov)))
test_indices <- setdiff(seq_len(nrow(diet_outcome_cov)), train_indices)
diet_data <- diet_outcome_cov[, postnatal_diet]
x_diet_train <- model.matrix(~ . + 0, data = diet_data[train_indices, ])
x_diet_test <- model.matrix(~ . + 0, data = diet_data[test_indices, ])
covariates <- diet_outcome_cov[, c("e3_sex_None", "e3_yearbir_None", "hs_child_age_None")]
x_covariates_train <- model.matrix(~ . + 0, data = covariates[train_indices, ])
x_covariates_test <- model.matrix(~ . + 0, data = covariates[test_indices, ])
x_full_train <- cbind(x_diet_train, x_covariates_train)
x_full_test <- cbind(x_diet_test, x_covariates_test)
x_full_train[is.na(x_full_train)] <- 0
x_full_test[is.na(x_full_test)] <- 0
x_diet_train[is.na(x_diet_train)] <- 0
x_diet_test[is.na(x_diet_test)] <- 0
y_train <- as.numeric(diet_outcome_cov$hs_zbmi_who[train_indices])
y_test <- as.numeric(diet_outcome_cov$hs_zbmi_who[test_indices])
fit_without_covariates <- cv.glmnet(x_diet_train, y_train, alpha = 0.5, family = "gaussian")
fit_without_covariates
##
## Call: cv.glmnet(x = x_diet_train, y = y_train, alpha = 0.5, family = "gaussian")
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.1384 9 1.431 0.06031 5
## 1se 0.2914 1 1.442 0.06168 0
plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")
best_lambda <- fit_without_covariates$lambda.min # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 41 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.52746936
## h_bfdur_Ter(0,10.8] .
## h_bfdur_Ter(10.8,34.9] .
## h_bfdur_Ter(34.9,Inf] .
## hs_bakery_prod_Ter(2,6] .
## hs_bakery_prod_Ter(6,Inf] .
## hs_beverages_Ter(0.132,1] .
## hs_beverages_Ter(1,Inf] .
## hs_break_cer_Ter(1.1,5.5] .
## hs_break_cer_Ter(5.5,Inf] .
## hs_caff_drink_Ter(0.132,Inf] .
## hs_dairy_Ter(14.6,25.6] .
## hs_dairy_Ter(25.6,Inf] .
## hs_fastfood_Ter(0.132,0.5] .
## hs_fastfood_Ter(0.5,Inf] .
## hs_org_food_Ter(0.132,1] .
## hs_org_food_Ter(1,Inf] -0.12911952
## hs_proc_meat_Ter(1.5,4] .
## hs_proc_meat_Ter(4,Inf] .
## hs_readymade_Ter(0.132,0.5] .
## hs_readymade_Ter(0.5,Inf] .
## hs_total_bread_Ter(7,17.5] .
## hs_total_bread_Ter(17.5,Inf] .
## hs_total_cereal_Ter(14.1,23.6] .
## hs_total_cereal_Ter(23.6,Inf] .
## hs_total_fish_Ter(1.5,3] .
## hs_total_fish_Ter(3,Inf] .
## hs_total_fruits_Ter(7,14.1] .
## hs_total_fruits_Ter(14.1,Inf] -0.02484832
## hs_total_lipids_Ter(3,7] .
## hs_total_lipids_Ter(7,Inf] -0.05017966
## hs_total_meat_Ter(6,9] .
## hs_total_meat_Ter(9,Inf] .
## hs_total_potatoes_Ter(3,4] .
## hs_total_potatoes_Ter(4,Inf] .
## hs_total_sweets_Ter(4.1,8.5] -0.01453272
## hs_total_sweets_Ter(8.5,Inf] .
## hs_total_veg_Ter(6,8.5] .
## hs_total_veg_Ter(8.5,Inf] -0.07850581
## hs_total_yog_Ter(6,8.5] .
## hs_total_yog_Ter(8.5,Inf] .
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_diet_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)
rmse_without_covariates <- sqrt(mse_without_covariates)
cat("Model without Covariates - Test RMSE:", rmse_without_covariates)
## Model without Covariates - Test RMSE: 1.161644
#selected diets that were noted in enet
diet_selected <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_break_cer_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter"
)
To analyze the impact of dietary, chemical, and demographic variables on BMI Z-scores, various statistical models were employed, each chosen for their unique strengths in handling different aspects of the data. These models were analyzed and performed with a 70-30% split in order to train models and predict the performance. Using a 10-fold cross validation was also applied to avoid overfitting and offer a reliable measure of a model’s predictive power and generalizability.
#selected chemicals that were noted in enet
chemicals_selected <- c(
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_detp_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_mepa_cadj_Log2",
"hs_oxbe_cadj_Log2",
"hs_prpa_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2")
#selected diets that were noted in enet
diet_selected <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_break_cer_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter"
)
combined_data_selected <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_break_cer_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter",
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_prpa_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2"
)
outcome_cov <- cbind(covariate_data, outcome_BMI)
outcome_cov <- outcome_cov[, !duplicated(colnames(outcome_cov))]
finalized_columns <- c(combined_data_selected)
final_selected_data <- exposome %>% dplyr::select(all_of(finalized_columns))
finalized_data <- cbind(outcome_cov, final_selected_data)
numeric_finalized <- finalized_data %>%
dplyr::select(where(is.numeric))
cor_matrix <- cor(numeric_finalized, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 90, tl.cex = 0.6)
find_highly_correlated <- function(cor_matrix, threshold = 0.8) {
cor_matrix[lower.tri(cor_matrix, diag = TRUE)] <- NA
cor_matrix <- as.data.frame(as.table(cor_matrix))
cor_matrix <- na.omit(cor_matrix)
cor_matrix <- cor_matrix[order(-abs(cor_matrix$Freq)), ]
cor_matrix <- cor_matrix %>% filter(abs(Freq) > threshold)
return(cor_matrix)
}
highly_correlated_pairs <- find_highly_correlated(cor_matrix, threshold = 0.50)
highly_correlated_pairs
The correlation plot for the selected variables indicates notable multicollinearity among various chemical variables and the child age covariate. Using grouped regression models like LASSO, ridge, and elastic net allows for the collective handling of these highly correlated variables. This approach helps with overfitting.
LASSO (Least Absolute Shrinkage and Selection Operator) applies L1 regularization to minimize the absolute sum of coefficients. It was used to perform variable selection and regularization, effectively identifying the most significant predictors while setting less important ones to zero. Ridge regression is particularly useful for handling multicollinearity among predictors, ensuring that all variables contribute to the prediction without being overly penalized. Elastic net balances the benefits of both LASSO ridge, so by handling both variable selection and multicollinearity, elastic net is well-suited for high-dimensional datasets where predictors are correlated. Since the data gets correlated with each other when combined (adding diet, chemicals and the metabolomic serum), group LASSO, ridge, and elastic net were applied.
Decision tree (with pruning) splits the data into subsets based on the value of input features, with pruning applied to prevent overfitting. This provides an interpretable structure for understanding the relationships between variables and the outcome, though it can be prone to overfitting without pruning. Pruning gets applied to enhance generalizability. Random forest constructs multiple decision trees and merges them to obtain a more accurate and stable prediction. By averaging the predictions from numerous trees, this model reduces overfitting and captures complex interactions among variables. Gradient Boosting Machine (GBM) builds an ensemble of trees in a sequential manner, where each tree corrects the errors of its predecessors. This approach is highly effective in improving predictive accuracy by focusing on the residuals of previous trees, making it powerful for capturing non-linear relationships. These ensemble methods (razndom forest and GBM) were chosen for their robustness and high predictive accuracy. Random forest reduces variance by averaging multiple trees, while GBM improves model performance through iterative refinement.
In this study, the Root Mean Squared Error (RMSE) is used as the primary performance metric to evaluate and compare the predictive models. Using RMSE allows for straightforward comparisons between different models and datasets. Since RMSE is in the same units as the outcome variable, it facilitates direct assessment of how well different models perform in predicting BMI Z-scores, making it easier to identify the model that provides the most accurate predictions. By employing a diverse set of models, this analysis aims to identify the most significant predictors of BMI Z-scores and understand the complex interactions between dietary intake, chemical exposures, and metabolomic serum data. The combination of regularization techniques and ensemble methods ensures a comprehensive and reliable assessment of the data.
covariates_selected <- c("hs_child_age_None", "e3_sex_None", "e3_yearbir_None")
baseline_data <- finalized_data %>% dplyr::select(c(covariates_selected, "hs_zbmi_who"))
x <- model.matrix(~ . -1, data = baseline_data[ , !names(baseline_data) %in% "hs_zbmi_who"])
y <- baseline_data$hs_zbmi_who
train_control <- trainControl(method = "cv", number = 10)
set.seed(101)
trainIndex <- createDataPartition(y, p = .7, list = FALSE, times = 1)
x_train <- x[trainIndex, ]
y_train <- y[trainIndex]
x_test <- x[-trainIndex, ]
y_test <- y[-trainIndex]
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0
set.seed(101)
fit_lasso <- cv.glmnet(x_train, y_train, alpha = 1, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_lasso)
coef(fit_lasso)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.87326347
## hs_child_age_None -0.05738585
## e3_sex_Nonefemale .
## e3_sex_Nonemale .
## e3_yearbir_None2004 .
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
lasso_predictions <- predict(fit_lasso, s = "lambda.min", newx = x_test)
mse_lasso <- mean((y_test - lasso_predictions)^2)
rmse_lasso <- sqrt(mse_lasso)
cat("Baseline Lasso RMSE:", rmse_lasso, "\n")
## Baseline Lasso RMSE: 1.152017
lasso_model <- train(
x_train, y_train,
method = "glmnet",
tuneGrid = expand.grid(alpha = 1, lambda = seq(0.001, 1, length = 100)),
trControl = train_control,
penalty.factor = penalty_factors
)
lasso_predictions_cv <- predict(lasso_model, x_test)
mse_lasso_cv <- mean((y_test - lasso_predictions_cv)^2)
rmse_lasso_cv <- sqrt(mse_lasso_cv)
cat("10-Fold CV Lasso RMSE:", rmse_lasso_cv)
## 10-Fold CV Lasso RMSE: 1.152017
The LASSO model for covariates only shows the selection of significant variables with their coefficients, with the optimal lambda chosen near log(λ) ≈ -4. This indicates the point where the model achieves the best balance between bias and variance, minimizing the mean squared error. The cross-validated RMSE is 1.152, indicating the model’s predictive accuracy on unseen data. Among the covariates, only the child’s age (hs_child_age_None) has a significant negative coefficient (-0.057), suggesting that as the child’s age increases, the BMI Z-score slightly decreases. The other variables (sex and birth years) were not selected as significant predictors in this model. This demonstrates that age is the most influential covariate in predicting BMI Z-score, while the other covariates did not have a significant impact.
set.seed(101)
fit_ridge <- cv.glmnet(x_train, y_train, alpha = 0, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_ridge)
coef(fit_ridge)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 8.732635e-01
## hs_child_age_None -5.738585e-02
## e3_sex_Nonefemale -3.110211e-38
## e3_sex_Nonemale 3.110211e-38
## e3_yearbir_None2004 -1.463782e-37
## e3_yearbir_None2005 2.704907e-38
## e3_yearbir_None2006 -9.332751e-39
## e3_yearbir_None2007 -5.445405e-38
## e3_yearbir_None2008 3.129795e-38
## e3_yearbir_None2009 3.620840e-37
ridge_predictions <- predict(fit_ridge, s = "lambda.min", newx = x_test)
mse_ridge <- mean((y_test - ridge_predictions)^2)
rmse_ridge <- sqrt(mse_ridge)
cat("Baseline Ridge RMSE:", rmse_ridge, "\n")
## Baseline Ridge RMSE: 1.152017
ridge_model <- train(
x_train, y_train,
method = "glmnet",
tuneGrid = expand.grid(alpha = 0, lambda = seq(0.001, 1, length = 100)),
trControl = train_control,
penalty.factor = penalty_factors
)
ridge_predictions_cv <- predict(ridge_model, x_test)
mse_ridge_cv <- mean((y_test - ridge_predictions_cv)^2)
rmse_ridge_cv <- sqrt(mse_ridge_cv)
cat("10-Fold CV Ridge RMSE:", rmse_ridge_cv, "\n")
## 10-Fold CV Ridge RMSE: 1.149128
The ridge regression model for the covariates focuses on minimizing model complexity to avoid overfitting. The mean squared error (MSE) plot against log(lambda) values shows the model’s performance across various regularization levels. The optimal lambda is selected where MSE is minimized, balancing complexity and accuracy. The baseline ridge model yields an RMSE of 1.152, with a cross-validated RMSE of 1.149, indicating stable performance. The child’s age (hs_child_age_None) shows a negative association with BMI Z-score, suggesting older children tend to have slightly lower BMI Z-scores. Other covariates, like sex and year of birth, have near-zero coefficients, indicating minimal impact.
set.seed(101)
fit_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_enet)
coef(fit_enet)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.87326347
## hs_child_age_None -0.05738585
## e3_sex_Nonefemale .
## e3_sex_Nonemale .
## e3_yearbir_None2004 .
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
enet_predictions <- predict(fit_enet, s = "lambda.min", newx = x_test)
mse_enet <- mean((y_test - enet_predictions)^2)
rmse_enet <- sqrt(mse_enet)
cat("Baseline Elastic Net RMSE:", rmse_enet, "\n")
## Baseline Elastic Net RMSE: 1.152017
enet_model <- train(
x_train, y_train,
method = "glmnet",
tuneGrid = expand.grid(alpha = seq(0, 1, length = 10), lambda = seq(0.001, 1, length = 100)),
trControl = train_control,
penalty.factor = penalty_factors
)
enet_predictions_cv <- predict(enet_model, x_test)
mse_enet_cv <- mean((y_test - enet_predictions_cv)^2)
rmse_enet_cv <- sqrt(mse_enet_cv)
cat("10-Fold CV Elastic Net RMSE:", rmse_enet_cv, "\n")
## 10-Fold CV Elastic Net RMSE: 1.152017
The elastic net model, which combines LASSO and ridge penalties for variable selection and coefficient shrinkage, achieved a baseline RMSE of 1.152. This model maintained an RMSE of 1.152 after 10-fold cross-validation, indicating stable performance. The optimal lambda was selected near log(λ) ≈ -3, suggesting minimal coefficient shrinkage. The model identified the child’s age (hs_child_age_None) as a significant variable with a coefficient of -0.0574, indicating a slight negative association with the BMI Z-score. Other covariates like sex and birth year did not show significant contributions.
set.seed(101)
trainIndex <- createDataPartition(baseline_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- baseline_data[trainIndex, ]
test_data <- baseline_data[-trainIndex, ]
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(fit_tree)
printcp(fit_tree)
##
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
##
## Variables actually used in tree construction:
## [1] hs_child_age_None
##
## Root node error: 1328.8/913 = 1.4554
##
## n= 913
##
## CP nsplit rel error xerror xstd
## 1 0.011581 0 1.00000 1.0026 0.051032
## 2 0.010000 1 0.98842 1.0130 0.051854
plotcp(fit_tree)
tree_predictions <- predict(fit_tree, newdata = test_data)
mse_tree <- mean((test_data$hs_zbmi_who - tree_predictions)^2)
rmse_tree <- sqrt(mse_tree)
cat("Baseline Unpruned Decision Tree RMSE:", rmse_tree, "\n")
## Baseline Unpruned Decision Tree RMSE: 1.155427
# getting the optimal cp value that minimizes the cross-validation error
optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]
cat("Optimal cp value:", optimal_cp, "\n")
## Optimal cp value: 0.01158106
pruned_tree <- prune(fit_tree, cp = optimal_cp)
rpart.plot(pruned_tree)
pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Baseline Pruned Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Baseline Pruned Decision Tree RMSE: 1.148665
Pruning prior to cross-validation proved ineffective as it retained the full tree. However, cross-validation effectively identified the minimal cp value, ensuring the pruned tree was optimal for generalizing the model performance across unseen data. The emphasis on a single variable underscores the simplicity yet potential power of the decision tree in modeling BMI outcomes based on age.
set.seed(101)
train_control <- trainControl(method = "cv", number = 10)
cp_grid <- expand.grid(cp = seq(0.001, 0.1, by = 0.001))
pruned_tree_model <- train(
hs_zbmi_who ~ .,
data = train_data,
method = "rpart",
trControl = train_control,
tuneGrid = cp_grid
)
best_cp <- pruned_tree_model$bestTune$cp
cat("Best cp value from cross-validation:", best_cp, "\n")
## Best cp value from cross-validation: 0.007
fit_tree_unpruned <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
fit_tree_best <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova", control = rpart.control(cp = best_cp))
rpart.plot(fit_tree_unpruned)
printcp(fit_tree_unpruned)
##
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
##
## Variables actually used in tree construction:
## [1] hs_child_age_None
##
## Root node error: 1328.8/913 = 1.4554
##
## n= 913
##
## CP nsplit rel error xerror xstd
## 1 0.011581 0 1.00000 1.0028 0.050936
## 2 0.010000 1 0.98842 1.0048 0.051009
plotcp(fit_tree_unpruned)
rpart.plot(fit_tree_best)
printcp(fit_tree_best)
##
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova",
## control = rpart.control(cp = best_cp))
##
## Variables actually used in tree construction:
## [1] hs_child_age_None
##
## Root node error: 1328.8/913 = 1.4554
##
## n= 913
##
## CP nsplit rel error xerror xstd
## 1 0.011581 0 1.00000 1.0034 0.050969
## 2 0.007000 1 0.98842 1.0159 0.050969
plotcp(fit_tree_best)
unpruned_tree_predictions <- predict(fit_tree_unpruned, newdata = test_data)
pruned_tree_predictions <- predict(fit_tree_best, newdata = test_data)
mse_unpruned_tree <- mean((test_data$hs_zbmi_who - unpruned_tree_predictions)^2)
rmse_unpruned_tree <- sqrt(mse_unpruned_tree)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Unpruned Decision Tree RMSE:", rmse_unpruned_tree, "\n")
## Unpruned Decision Tree RMSE: 1.155427
cat("Pruned Decision Tree RMSE with Best cp:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE with Best cp: 1.155427
The baseline decision tree model, which solely used the variable “hs_child_age_None,” yielded an RMSE of 1.155. This indicates the initial model’s error rate when predicting BMI Z-scores. Pruning the tree with the optimal complexity parameter (cp) of 0.0116, derived from the cross-validation, improved the RMSE to 1.149, demonstrating reduced overfitting and better performance.
After cross-validation, the optimal cp value was identified as 0.007. The pruned tree with this cp value did not further reduce the RMSE compared to the initial pruning, indicating the previous pruning was already optimal.
The pruned decision tree using cross-validation emphasizes the significance of “hs_child_age_None” in predicting BMI Z-scores. It splits the data based on whether the child’s age is greater than or equal to 6.5 years. Children younger than 6.5 years have a higher BMI Z-score (0.64) compared to older children (0.34), suggesting age as a critical factor influencing BMI.
fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data, ntree = 500, importance = TRUE)
varImpPlot(fit_rf)
rf_predictions <- predict(fit_rf, newdata = test_data)
mse_rf <- mean((test_data$hs_zbmi_who - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)
cat("Baseline Random Forest RMSE:", rmse_rf, "\n")
## Baseline Random Forest RMSE: 1.157731
rf_model <- train(
x_train, y_train,
method = "rf",
trControl = train_control,
tuneLength = 10
)
## note: only 8 unique complexity parameters in default grid. Truncating the grid to 8 .
rf_predictions_cv <- predict(rf_model, x_test)
mse_rf_cv <- mean((y_test - rf_predictions_cv)^2)
rmse_rf_cv <- sqrt(mse_rf_cv)
cat("10-Fold CV Random Forest RMSE:", rmse_rf_cv, "\n")
## 10-Fold CV Random Forest RMSE: 1.154789
For the random forest model, the RMSE of the baseline model was 1.158, and after performing 10-fold cross-validation, the RMSE slightly improved to 1.155. The model identified all three significant variables: child’s age (hs_child_age_None), year of birth (e3_yearbir_None), and sex (e3_sex_None). These variables contributed the most to the model’s prediction accuracy. The child’s age showed the highest importance, indicating that the age of the child is the most influential factor in predicting BMI Z-score. The year of birth and sex also played notable roles but with lesser importance compared to age.
fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 1000, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5, verbose = FALSE)
summary(fit_gbm)
best_trees <- gbm.perf(fit_gbm, method = "cv")
gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_trees)
mse_gbm <- mean((test_data$hs_zbmi_who - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)
cat("Baseline GBM RMSE:", rmse_gbm, "\n")
## Baseline GBM RMSE: 1.14888
gbm_model <- train(
x_train, y_train,
method = "gbm",
trControl = train_control,
tuneLength = 10,
verbose = FALSE
)
gbm_predictions_cv <- predict(gbm_model, x_test)
mse_gbm_cv <- mean((y_test - gbm_predictions_cv)^2)
rmse_gbm_cv <- sqrt(mse_gbm_cv)
cat("10-Fold CV GBM RMSE:", rmse_gbm_cv, "\n")
## 10-Fold CV GBM RMSE: 1.149695
The GBM model produced a baseline RMSE of 1.1488 and a 10-fold cross-validated RMSE of 1.1497, showing good accuracy. Key variables include child age (hs_child_age_None), year of birth (e3_yearbir_None), and sex (e3_sex_None), with child age being the most influential. The relative influence of child age (79.6) far exceeds that of birth year (14.7) and sex (5.7), indicating its dominant role in predicting BMI Z-score. The model’s performance demonstrates that these covariates significantly impact BMI predictions in children.
Child age consistently emerged as the most influential predictor across all models, demonstrating its strong association with BMI Z-score. While the ridge model and GBM provided slightly better performance, the choice between models may depend on the desired balance between interpretability (decision tree, LASSO) and predictive power (random forest, GBM). The ridge and elastic Net models offer a balance, retaining some influence from additional covariates, albeit minimal.
diet_selected <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_break_cer_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_proc_meat_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter"
)
covariates_selected <- c("hs_child_age_None", "e3_sex_None", "e3_yearbir_None")
diet_data <- finalized_data %>% dplyr::select(c(covariates_selected, diet_selected, "hs_zbmi_who"))
set.seed(101)
trainIndex <- createDataPartition(diet_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- diet_data[trainIndex, ]
test_data <- diet_data[-trainIndex, ]
train_control <- trainControl(method = "cv", number = 10)
x_train <- model.matrix(hs_zbmi_who ~ . -1, train_data)
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, test_data)
y_test <- test_data$hs_zbmi_who
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0
fit_lasso <- cv.glmnet(x_train, y_train, alpha = 1, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_lasso)
coef(fit_lasso)
## 34 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.87326347
## hs_child_age_None -0.05738585
## e3_sex_Nonefemale .
## e3_sex_Nonemale .
## e3_yearbir_None2004 .
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
## h_bfdur_Ter(10.8,34.9] .
## h_bfdur_Ter(34.9,Inf] .
## hs_bakery_prod_Ter(2,6] .
## hs_bakery_prod_Ter(6,Inf] .
## hs_break_cer_Ter(1.1,5.5] .
## hs_break_cer_Ter(5.5,Inf] .
## hs_dairy_Ter(14.6,25.6] .
## hs_dairy_Ter(25.6,Inf] .
## hs_fastfood_Ter(0.132,0.5] .
## hs_fastfood_Ter(0.5,Inf] .
## hs_org_food_Ter(0.132,1] .
## hs_org_food_Ter(1,Inf] .
## hs_proc_meat_Ter(1.5,4] .
## hs_proc_meat_Ter(4,Inf] .
## hs_total_fish_Ter(1.5,3] .
## hs_total_fish_Ter(3,Inf] .
## hs_total_fruits_Ter(7,14.1] .
## hs_total_fruits_Ter(14.1,Inf] .
## hs_total_lipids_Ter(3,7] .
## hs_total_lipids_Ter(7,Inf] .
## hs_total_sweets_Ter(4.1,8.5] .
## hs_total_sweets_Ter(8.5,Inf] .
## hs_total_veg_Ter(6,8.5] .
## hs_total_veg_Ter(8.5,Inf] .
lasso_predictions <- predict(fit_lasso, s = "lambda.min", newx = x_test)
mse_lasso <- mean((y_test - lasso_predictions)^2)
rmse_lasso <- sqrt(mse_lasso)
cat("Diet + Covariates Lasso RMSE:", rmse_lasso, "\n")
## Diet + Covariates Lasso RMSE: 1.138756
lasso_model <- train(
x_train, y_train,
method = "glmnet",
tuneGrid = expand.grid(alpha = 1, lambda = seq(0.001, 1, length = 100)),
trControl = train_control,
penalty.factor = penalty_factors
)
lasso_predictions_cv <- predict(lasso_model, x_test)
mse_lasso_cv <- mean((y_test - lasso_predictions_cv)^2)
rmse_lasso_cv <- sqrt(mse_lasso_cv)
cat("10-Fold CV Lasso RMSE:", rmse_lasso_cv)
## 10-Fold CV Lasso RMSE: 1.139468
The LASSO regression model for the combination of diet and covariates yielded an RMSE of 1.139, with a cross-validated RMSE of 1.140, indicating consistency in model performance. The optimal lambda value was determined to be near log(λ) ≈ -3, as shown in the plot. In this model, the variable child’s age was the only significant variable, with a coefficient of -0.0574, indicating that as the child’s age increases, the BMI Z-score decreases. This suggests that age plays a crucial role in predicting BMI Z-score in the presence of diet and covariate factors. Other variables, such as dietary intake categories, did not show significant coefficients, implying limited influence on BMI Z-score within the context of this model.
fit_ridge <- cv.glmnet(x_train, y_train, alpha = 0, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_ridge)
coef(fit_ridge)
## 34 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 8.732635e-01
## hs_child_age_None -5.738585e-02
## e3_sex_Nonefemale -3.392957e-38
## e3_sex_Nonemale 3.392957e-38
## e3_yearbir_None2004 -1.596853e-37
## e3_yearbir_None2005 2.950807e-38
## e3_yearbir_None2006 -1.018118e-38
## e3_yearbir_None2007 -5.940442e-38
## e3_yearbir_None2008 3.414322e-38
## e3_yearbir_None2009 3.950007e-37
## h_bfdur_Ter(10.8,34.9] 1.959443e-37
## h_bfdur_Ter(34.9,Inf] -1.196150e-37
## hs_bakery_prod_Ter(2,6] 2.102760e-37
## hs_bakery_prod_Ter(6,Inf] -1.896677e-37
## hs_break_cer_Ter(1.1,5.5] 9.801144e-39
## hs_break_cer_Ter(5.5,Inf] -2.286856e-37
## hs_dairy_Ter(14.6,25.6] 5.634165e-38
## hs_dairy_Ter(25.6,Inf] 8.216707e-41
## hs_fastfood_Ter(0.132,0.5] 1.215262e-37
## hs_fastfood_Ter(0.5,Inf] -6.235230e-38
## hs_org_food_Ter(0.132,1] 8.146218e-38
## hs_org_food_Ter(1,Inf] -1.900266e-37
## hs_proc_meat_Ter(1.5,4] 1.193171e-37
## hs_proc_meat_Ter(4,Inf] -1.994408e-38
## hs_total_fish_Ter(1.5,3] -8.902720e-38
## hs_total_fish_Ter(3,Inf] 2.141752e-38
## hs_total_fruits_Ter(7,14.1] 2.168036e-37
## hs_total_fruits_Ter(14.1,Inf] -2.040600e-37
## hs_total_lipids_Ter(3,7] -9.170321e-39
## hs_total_lipids_Ter(7,Inf] -2.153479e-37
## hs_total_sweets_Ter(4.1,8.5] -7.677763e-38
## hs_total_sweets_Ter(8.5,Inf] -1.716694e-38
## hs_total_veg_Ter(6,8.5] 2.132182e-38
## hs_total_veg_Ter(8.5,Inf] -1.282604e-37
ridge_predictions <- predict(fit_ridge, s = "lambda.min", newx = x_test)
mse_ridge <- mean((y_test - ridge_predictions)^2)
rmse_ridge <- sqrt(mse_ridge)
cat("Diet + Covariates Ridge RMSE:", rmse_ridge, "\n")
## Diet + Covariates Ridge RMSE: 1.130793
ridge_model <- train(
x_train, y_train,
method = "glmnet",
tuneGrid = expand.grid(alpha = 0, lambda = seq(0.001, 1, length = 100)),
trControl = train_control,
penalty.factor = penalty_factors
)
ridge_predictions_cv <- predict(ridge_model, x_test)
mse_ridge_cv <- mean((y_test - ridge_predictions_cv)^2)
rmse_ridge_cv <- sqrt(mse_ridge_cv)
cat("10-Fold CV Ridge RMSE:", rmse_ridge_cv, "\n")
## 10-Fold CV Ridge RMSE: 1.129036
For the ridge regression model, the optimal lambda was identified at approximately log(λ) = 0. The baseline RMSE for this model is 1.131, and the 10-fold cross-validated RMSE is 1.129, indicating consistent model performance. Important variables in this model include hs_child_age_None with a coefficient of -0.057, suggesting that as the age of the child increases, the BMI Z-score tends to decrease. Other dietary variables and covariates, such as various food intake categories (e.g., bakery products, dairy, fast food), did not show significant coefficients, suggesting that they have less influence in this ridge regression model for predicting BMI Z-score.
fit_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_enet)
coef(fit_enet)
## 34 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.87326347
## hs_child_age_None -0.05738585
## e3_sex_Nonefemale .
## e3_sex_Nonemale .
## e3_yearbir_None2004 .
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
## h_bfdur_Ter(10.8,34.9] .
## h_bfdur_Ter(34.9,Inf] .
## hs_bakery_prod_Ter(2,6] .
## hs_bakery_prod_Ter(6,Inf] .
## hs_break_cer_Ter(1.1,5.5] .
## hs_break_cer_Ter(5.5,Inf] .
## hs_dairy_Ter(14.6,25.6] .
## hs_dairy_Ter(25.6,Inf] .
## hs_fastfood_Ter(0.132,0.5] .
## hs_fastfood_Ter(0.5,Inf] .
## hs_org_food_Ter(0.132,1] .
## hs_org_food_Ter(1,Inf] .
## hs_proc_meat_Ter(1.5,4] .
## hs_proc_meat_Ter(4,Inf] .
## hs_total_fish_Ter(1.5,3] .
## hs_total_fish_Ter(3,Inf] .
## hs_total_fruits_Ter(7,14.1] .
## hs_total_fruits_Ter(14.1,Inf] .
## hs_total_lipids_Ter(3,7] .
## hs_total_lipids_Ter(7,Inf] .
## hs_total_sweets_Ter(4.1,8.5] .
## hs_total_sweets_Ter(8.5,Inf] .
## hs_total_veg_Ter(6,8.5] .
## hs_total_veg_Ter(8.5,Inf] .
enet_predictions <- predict(fit_enet, s = "lambda.min", newx = x_test)
mse_enet <- mean((y_test - enet_predictions)^2)
rmse_enet <- sqrt(mse_enet)
cat("Diet + Covariates Elastic Net RMSE:", rmse_enet, "\n")
## Diet + Covariates Elastic Net RMSE: 1.136809
enet_model <- train(
x_train, y_train,
method = "glmnet",
tuneGrid = expand.grid(alpha = seq(0, 1, length = 10), lambda = seq(0.001, 1, length = 100)),
trControl = train_control,
penalty.factor = penalty_factors
)
enet_predictions_cv <- predict(enet_model, x_test)
mse_enet_cv <- mean((y_test - enet_predictions_cv)^2)
rmse_enet_cv <- sqrt(mse_enet_cv)
cat("10-Fold CV Elastic Net RMSE:", rmse_enet_cv, "\n")
## 10-Fold CV Elastic Net RMSE: 1.138767
The elastic net model’s optimal lambda value, near log(λ) ≈ -3, minimizes the mean squared error. Important variables identified include “hs_child_age_None” with a coefficient of -0.057, suggesting that older age is associated with a lower BMI Z-score, while other covariates like sex and year of birth showed no significant impact. The baseline RMSE of 1.137 and the cross-validated RMSE of 1.139 indicate the model’s performance and consistency.
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(fit_tree)
tree_predictions <- predict(fit_tree, newdata = test_data)
mse_tree <- mean((test_data$hs_zbmi_who - tree_predictions)^2)
rmse_tree <- sqrt(mse_tree)
cat("Diet + Covariates Decision Tree RMSE:", rmse_tree, "\n")
## Diet + Covariates Decision Tree RMSE: 1.152094
printcp(fit_tree)
##
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
##
## Variables actually used in tree construction:
## [1] hs_bakery_prod_Ter hs_break_cer_Ter hs_child_age_None
## [4] hs_total_lipids_Ter
##
## Root node error: 1328.8/913 = 1.4554
##
## n= 913
##
## CP nsplit rel error xerror xstd
## 1 0.011773 0 1.00000 1.0010 0.050839
## 2 0.010484 3 0.96468 1.0155 0.052218
## 3 0.010000 4 0.95420 1.0156 0.052011
plotcp(fit_tree)
optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]
pruned_tree <- prune(fit_tree, cp = optimal_cp)
rpart.plot(pruned_tree)
pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Pruned Diet + Covariates Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Pruned Diet + Covariates Decision Tree RMSE: 1.148665
set.seed(101)
train_control <- trainControl(method = "cv", number = 10)
cp_grid <- expand.grid(cp = seq(0.001, 0.1, by = 0.001))
pruned_tree_model <- train(
hs_zbmi_who ~ .,
data = train_data,
method = "rpart",
trControl = train_control,
tuneGrid = cp_grid
)
best_cp <- pruned_tree_model$bestTune$cp
cat("Best cp value from cross-validation:", best_cp, "\n")
## Best cp value from cross-validation: 0.1
fit_tree_unpruned <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
fit_tree_best <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova", control = rpart.control(cp = best_cp))
rpart.plot(fit_tree_unpruned)
rpart.plot(fit_tree_best)
unpruned_tree_predictions <- predict(fit_tree_unpruned, newdata = test_data)
pruned_tree_predictions <- predict(fit_tree_best, newdata = test_data)
mse_unpruned_tree <- mean((test_data$hs_zbmi_who - unpruned_tree_predictions)^2)
rmse_unpruned_tree <- sqrt(mse_unpruned_tree)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Unpruned Decision Tree RMSE:", rmse_unpruned_tree, "\n")
## Unpruned Decision Tree RMSE: 1.152094
cat("Pruned Decision Tree RMSE with Best cp:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE with Best cp: 1.148665
The decision tree model for diet and covariates demonstrates that the initial unpruned model has an RMSE of 1.153, which slightly improves to 1.149 after pruning. Pruning helps by eliminating less important branches, which reduces overfitting and enhances the model’s generalization capability. The complexity parameter (cp) of 0.011 was identified as the optimal value for pruning, minimizing cross-validation error. Key variables influencing the BMI Z-score include the child’s age (hs_child_age_None), and dietary factors such as high consumption of bakery products (hs_bakery_prod_Ter), breakfast cereals (hs_break_cer_Ter), and lipids (hs_total_lipids_Ter). These results indicate the significant role of both age and specific dietary patterns in predicting BMI outcomes in children.
set.seed(101)
fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data)
varImpPlot(fit_rf)
importance(fit_rf)
## IncNodePurity
## hs_child_age_None 233.57368
## e3_sex_None 34.19375
## e3_yearbir_None 97.75201
## h_bfdur_Ter 65.45676
## hs_bakery_prod_Ter 64.03655
## hs_break_cer_Ter 65.99478
## hs_dairy_Ter 69.72797
## hs_fastfood_Ter 55.45883
## hs_org_food_Ter 65.89629
## hs_proc_meat_Ter 68.85280
## hs_total_fish_Ter 63.72114
## hs_total_fruits_Ter 64.85756
## hs_total_lipids_Ter 65.79002
## hs_total_sweets_Ter 66.74543
## hs_total_veg_Ter 69.89130
rf_predictions <- predict(fit_rf, newdata = test_data)
mse_rf <- mean((test_data$hs_zbmi_who - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)
cat("Diet + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Diet + Covariates Random Forest RMSE: 1.139899
rf_model <- train(
x_train, y_train,
method = "rf",
trControl = train_control,
tuneLength = 10
)
rf_predictions_cv <- predict(rf_model, x_test)
mse_rf_cv <- mean((y_test - rf_predictions_cv)^2)
rmse_rf_cv <- sqrt(mse_rf_cv)
cat("10-Fold CV Random Forest RMSE:", rmse_rf_cv, "\n")
## 10-Fold CV Random Forest RMSE: 1.129478
The random forest model yielded an RMSE of 1.140, with a 10-fold cross-validated RMSE of 1.129, indicating robust model performance. The most influential variable in this model was the child’s age (hs_child_age_None), followed by the year of birth (e3_yearbir_None), and sex (e3_sex_None). Dietary variables also played significant roles, such as the total vegetable intake (hs_total_veg_Ter), processed meat intake (hs_proc_meat_Ter), bakery products (hs_bakery_prod_Ter), breakfast cereal intake (hs_break_cer_Ter), and fruit consumption (hs_total_fruits_Ter). These dietary factors, alongside the covariates, contributed to the prediction accuracy, highlighting the model’s ability to capture the multifactorial influences on the BMI Z-score in children.
set.seed(101)
fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 1000, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5, verbose = FALSE)
summary(fit_gbm)
best_trees <- gbm.perf(fit_gbm, method = "cv")
gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_trees)
mse_gbm <- mean((test_data$hs_zbmi_who - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)
cat("Diet + Covariates GBM RMSE:", rmse_gbm, "\n")
## Diet + Covariates GBM RMSE: 1.137923
gbm_model <- train(
x_train, y_train,
method = "gbm",
trControl = train_control,
tuneLength = 10,
verbose = FALSE
)
gbm_predictions_cv <- predict(gbm_model, x_test)
mse_gbm_cv <- mean((y_test - gbm_predictions_cv)^2)
rmse_gbm_cv <- sqrt(mse_gbm_cv)
cat("10-Fold CV GBM RMSE:", rmse_gbm_cv, "\n")
## 10-Fold CV GBM RMSE: 1.136542
The GBM modelyielded an RMSE of 1.138, with a cross-validated RMSE of 1.136, demonstrating strong predictive accuracy. The variable “hs_child_age_None” had the highest relative influence (27.745752), indicating its significant impact on the model. Other important variables included “e3_yearbir_None” (12.544772) and “hs_break_cer_Ter” (6.344034), reflecting their substantial contributions to the model’s performance. The diet-related variables such as “hs_total_lipids_Ter” (5.849461), “hs_bakery_prod_Ter” (5.394027), and “hs_total_veg_Ter” (5.773139) also played crucial roles in predicting BMI Z-score, highlighting the importance between dietary habits and covariates in influencing childhood BMI outcomes.
Overall, “hs_child_age_None” consistently emerged as the most influential variable across all models, highlighting its critical role in predicting BMI Z-score. Dietary variables such as “hs_bakery_prod_Ter,” “hs_break_cer_Ter,” and “hs_total_lipids_Ter” were also significant in multiple models, underscoring the interplay between diet and child age in influencing BMI outcomes. The GBM model, with the lowest RMSE, suggests the highest predictive accuracy among the methods evaluated.
chemicals_selected <- c(
"hs_cd_c_Log2", "hs_co_c_Log2", "hs_cs_c_Log2", "hs_cu_c_Log2",
"hs_hg_c_Log2", "hs_mo_c_Log2", "hs_pb_c_Log2", "hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2", "hs_pcb170_cadj_Log2", "hs_dep_cadj_Log2",
"hs_pbde153_cadj_Log2", "hs_pfhxs_c_Log2", "hs_pfoa_c_Log2",
"hs_pfos_c_Log2", "hs_prpa_cadj_Log2", "hs_mbzp_cadj_Log2", "hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2"
)
chemical_data <- finalized_data %>% dplyr::select(c(covariates_selected, chemicals_selected, "hs_zbmi_who"))
set.seed(101)
trainIndex <- createDataPartition(chemical_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- chemical_data[trainIndex, ]
test_data <- chemical_data[-trainIndex, ]
train_control <- trainControl(method = "cv", number = 10)
x_train <- model.matrix(hs_zbmi_who ~ . -1, train_data)
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, test_data)
y_test <- test_data$hs_zbmi_who
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0
fit_lasso <- cv.glmnet(x_train, y_train, alpha = 1, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_lasso)
coef(fit_lasso)
## 29 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.201642945
## hs_child_age_None -0.042729967
## e3_sex_Nonefemale .
## e3_sex_Nonemale .
## e3_yearbir_None2004 -0.040205226
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
## hs_cd_c_Log2 -0.011319810
## hs_co_c_Log2 .
## hs_cs_c_Log2 .
## hs_cu_c_Log2 0.260654501
## hs_hg_c_Log2 .
## hs_mo_c_Log2 -0.069199903
## hs_pb_c_Log2 .
## hs_dde_cadj_Log2 -0.041339092
## hs_pcb153_cadj_Log2 -0.125818759
## hs_pcb170_cadj_Log2 -0.043143964
## hs_dep_cadj_Log2 -0.003677983
## hs_pbde153_cadj_Log2 -0.036720000
## hs_pfhxs_c_Log2 .
## hs_pfoa_c_Log2 -0.097979410
## hs_pfos_c_Log2 .
## hs_prpa_cadj_Log2 .
## hs_mbzp_cadj_Log2 0.007077421
## hs_mibp_cadj_Log2 -0.025837249
## hs_mnbp_cadj_Log2 -0.004654193
lasso_predictions <- predict(fit_lasso, s = "lambda.min", newx = x_test)
mse_lasso <- mean((y_test - lasso_predictions)^2)
rmse_lasso <- sqrt(mse_lasso)
cat("Chemical + Covariates Lasso RMSE:", rmse_lasso, "\n")
## Chemical + Covariates Lasso RMSE: 1.040557
lasso_model <- train(
x_train, y_train,
method = "glmnet",
tuneGrid = expand.grid(alpha = 1, lambda = seq(0.001, 1, length = 100)),
trControl = train_control,
penalty.factor = penalty_factors
)
lasso_predictions_cv <- predict(lasso_model, x_test)
mse_lasso_cv <- mean((y_test - lasso_predictions_cv)^2)
rmse_lasso_cv <- sqrt(mse_lasso_cv)
cat("10-Fold CV Lasso RMSE:", rmse_lasso_cv)
## 10-Fold CV Lasso RMSE: 1.040727
The LASSO regression model for chemical and covariate data includes variables such as age, sex, year of birth, and various chemical concentrations. The optimal lambda (log(λ) ≈ -3) minimizes the cross-validated error, ensuring a balance between bias and variance. The model shows a cross-validated RMSE of 1.041, indicating reasonable prediction accuracy. Important variables include hs_child_age_None (-0.04273), hs_cu_c_Log2 (0.26065), and hs_pfoa_c_Log2 (-0.0978), highlighting their significant contributions to predicting the outcome. Negative coefficients suggest an inverse relationship with the BMI Z-score, whereas positive coefficients indicate a direct relationship.
fit_ridge <- cv.glmnet(x_train, y_train, alpha = 0, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_ridge)
coef(fit_ridge)
## 29 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.609471669
## hs_child_age_None -0.054184972
## e3_sex_Nonefemale -0.023174163
## e3_sex_Nonemale 0.023148148
## e3_yearbir_None2004 -0.128309522
## e3_yearbir_None2005 0.028409105
## e3_yearbir_None2006 0.053001523
## e3_yearbir_None2007 -0.027948121
## e3_yearbir_None2008 -0.015135402
## e3_yearbir_None2009 0.235538660
## hs_cd_c_Log2 -0.036089222
## hs_co_c_Log2 -0.030180059
## hs_cs_c_Log2 0.066973465
## hs_cu_c_Log2 0.319809539
## hs_hg_c_Log2 -0.014578657
## hs_mo_c_Log2 -0.071569755
## hs_pb_c_Log2 -0.039355400
## hs_dde_cadj_Log2 -0.048885680
## hs_pcb153_cadj_Log2 -0.115503946
## hs_pcb170_cadj_Log2 -0.036730018
## hs_dep_cadj_Log2 -0.011840831
## hs_pbde153_cadj_Log2 -0.027003858
## hs_pfhxs_c_Log2 -0.026465601
## hs_pfoa_c_Log2 -0.101071688
## hs_pfos_c_Log2 -0.025692420
## hs_prpa_cadj_Log2 0.001538225
## hs_mbzp_cadj_Log2 0.039172212
## hs_mibp_cadj_Log2 -0.035550506
## hs_mnbp_cadj_Log2 -0.039493899
ridge_predictions <- predict(fit_ridge, s = "lambda.min", newx = x_test)
mse_ridge <- mean((y_test - ridge_predictions)^2)
rmse_ridge <- sqrt(mse_ridge)
cat("Chemical + Covariates Ridge RMSE:", rmse_ridge, "\n")
## Chemical + Covariates Ridge RMSE: 1.035505
ridge_model <- train(
x_train, y_train,
method = "glmnet",
tuneGrid = expand.grid(alpha = 0, lambda = seq(0.001, 1, length = 100)),
trControl = train_control,
penalty.factor = penalty_factors
)
ridge_predictions_cv <- predict(ridge_model, x_test)
mse_ridge_cv <- mean((y_test - ridge_predictions_cv)^2)
rmse_ridge_cv <- sqrt(mse_ridge_cv)
cat("10-Fold CV Ridge RMSE:", rmse_ridge_cv, "\n")
## 10-Fold CV Ridge RMSE: 1.035411
The ridge regression model for the combination of chemicals and covariates yields an RMSE of 1.036, demonstrating its predictive accuracy. The optimal lambda is chosen at a value near log(λ) ≈ -2, which balances the bias-variance tradeoff effectively. Significant variables influencing BMI Z-scores include the child’s age, sex, and birth year, as well as the concentrations of various chemicals. For instance, higher levels of cadmium (hs_cd_c_Log2) and copper (hs_cu_c_Log2) show notable positive associations with BMI Z-scores. Conversely, higher concentrations of lead (hs_pb_c_Log2) and specific PCBs (hs_pcb153_cadj_Log2) are negatively associated with BMI Z-scores.
fit_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_enet)
coef(fit_enet)
## 29 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.30722494
## hs_child_age_None -0.04475073
## e3_sex_Nonefemale .
## e3_sex_Nonemale .
## e3_yearbir_None2004 -0.06251662
## e3_yearbir_None2005 .
## e3_yearbir_None2006 .
## e3_yearbir_None2007 .
## e3_yearbir_None2008 .
## e3_yearbir_None2009 .
## hs_cd_c_Log2 -0.01538343
## hs_co_c_Log2 .
## hs_cs_c_Log2 .
## hs_cu_c_Log2 0.27721103
## hs_hg_c_Log2 .
## hs_mo_c_Log2 -0.07149465
## hs_pb_c_Log2 .
## hs_dde_cadj_Log2 -0.04498749
## hs_pcb153_cadj_Log2 -0.12563741
## hs_pcb170_cadj_Log2 -0.04301914
## hs_dep_cadj_Log2 -0.00523836
## hs_pbde153_cadj_Log2 -0.03600515
## hs_pfhxs_c_Log2 .
## hs_pfoa_c_Log2 -0.10302534
## hs_pfos_c_Log2 .
## hs_prpa_cadj_Log2 .
## hs_mbzp_cadj_Log2 0.01485361
## hs_mibp_cadj_Log2 -0.02916026
## hs_mnbp_cadj_Log2 -0.01195464
enet_predictions <- predict(fit_enet, s = "lambda.min", newx = x_test)
mse_enet <- mean((y_test - enet_predictions)^2)
rmse_enet <- sqrt(mse_enet)
cat("Chemical + Covariates Elastic Net RMSE:", rmse_enet, "\n")
## Chemical + Covariates Elastic Net RMSE: 1.040324
enet_model <- train(
x_train, y_train,
method = "glmnet",
tuneGrid = expand.grid(alpha = seq(0, 1, length = 10), lambda = seq(0.001, 1, length = 100)),
trControl = train_control,
penalty.factor = penalty_factors
)
enet_predictions_cv <- predict(enet_model, x_test)
mse_enet_cv <- mean((y_test - enet_predictions_cv)^2)
rmse_enet_cv <- sqrt(mse_enet_cv)
cat("10-Fold CV Elastic Net RMSE:", rmse_enet_cv, "\n")
## 10-Fold CV Elastic Net RMSE: 1.035456
The elastic net model’s optimal lambda, chosen around log(λ) ≈ -2.3, minimizes the mean-squared error, ensuring the best fit for the model. Significant variables included child age, with a negative coefficient indicating that older age is associated with a lower BMI Z-score. Higher log-transformed copper levels showed a strong positive association with BMI Z-score, while other chemicals such as cadmium, lead, and PCBs (polychlorinated biphenyls) displayed negative associations. The model’s baseline RMSE was 1.040, and the 10-fold cross-validated RMSE was 1.036, indicating good predictive performance. The optimal lambda helped minimize prediction error, ensuring robust model accuracy.
set.seed(101)
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(fit_tree)
tree_predictions <- predict(fit_tree, newdata = test_data)
mse_tree <- mean((test_data$hs_zbmi_who - tree_predictions)^2)
rmse_tree <- sqrt(mse_tree)
cat("Chemical + Covariates Decision Tree RMSE:", rmse_tree, "\n")
## Chemical + Covariates Decision Tree RMSE: 1.108581
printcp(fit_tree)
##
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
##
## Variables actually used in tree construction:
## [1] hs_cu_c_Log2 hs_dde_cadj_Log2 hs_mbzp_cadj_Log2
## [4] hs_mnbp_cadj_Log2 hs_pbde153_cadj_Log2 hs_pcb170_cadj_Log2
## [7] hs_pfoa_c_Log2
##
## Root node error: 1328.8/913 = 1.4554
##
## n= 913
##
## CP nsplit rel error xerror xstd
## 1 0.077388 0 1.00000 1.00089 0.050861
## 2 0.032769 1 0.92261 0.93652 0.048377
## 3 0.032579 2 0.88984 0.91682 0.046750
## 4 0.025008 3 0.85726 0.91218 0.046570
## 5 0.014062 4 0.83226 0.89795 0.045241
## 6 0.013674 6 0.80413 0.89837 0.046138
## 7 0.013602 7 0.79046 0.89907 0.046094
## 8 0.013517 8 0.77686 0.89907 0.046094
## 9 0.011314 9 0.76334 0.90679 0.046317
## 10 0.011047 11 0.74071 0.91227 0.047148
## 11 0.010329 12 0.72967 0.92524 0.047400
## 12 0.010000 14 0.70901 0.92466 0.047181
plotcp(fit_tree)
optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]
pruned_tree <- prune(fit_tree, cp = optimal_cp)
rpart.plot(pruned_tree)
pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Pruned Chemical + Covariates Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Pruned Chemical + Covariates Decision Tree RMSE: 1.089746
set.seed(101)
train_control <- trainControl(method = "cv", number = 10)
cp_grid <- expand.grid(cp = seq(0.001, 0.1, by = 0.001))
pruned_tree_model <- train(
hs_zbmi_who ~ .,
data = train_data,
method = "rpart",
trControl = train_control,
tuneGrid = cp_grid
)
best_cp <- pruned_tree_model$bestTune$cp
cat("Best cp value from cross-validation:", best_cp, "\n")
## Best cp value from cross-validation: 0.023
fit_tree_unpruned <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
fit_tree_best <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova", control = rpart.control(cp = best_cp))
rpart.plot(fit_tree_unpruned)
rpart.plot(fit_tree_best)
unpruned_tree_predictions <- predict(fit_tree_unpruned, newdata = test_data)
pruned_tree_predictions <- predict(fit_tree_best, newdata = test_data)
mse_unpruned_tree <- mean((test_data$hs_zbmi_who - unpruned_tree_predictions)^2)
rmse_unpruned_tree <- sqrt(mse_unpruned_tree)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Unpruned Decision Tree RMSE:", rmse_unpruned_tree, "\n")
## Unpruned Decision Tree RMSE: 1.108581
cat("Pruned Decision Tree RMSE with Best cp:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE with Best cp: 1.089746
The decision tree model for predicting BMI z-scores using chemical and covariate data yielded an RMSE of 1.108, indicating the model’s prediction accuracy. The primary variables used in the tree construction include log-transformed concentrations of PCB170, PBDE153, DDE, and MBZP, among others. Pruning the tree improved the RMSE to 1.090, highlighting the benefits of removing less significant branches. The optimal complexity parameter (cp) value from cross-validation was 0.023. This pruned model indicates a more generalized fit by reducing overfitting, evident from the lower cross-validated RMSE. The decision tree structure reflects the hierarchical influence of these chemical variables on BMI z-scores, with PCB170 and PBDE153 being crucial decision nodes in both the pruned and unpruned trees.
fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data, ntree = 500)
rf_predictions <- predict(fit_rf, newdata = test_data)
varImpPlot(fit_rf)
importance(fit_rf)
## IncNodePurity
## hs_child_age_None 48.822971
## e3_sex_None 6.356073
## e3_yearbir_None 34.327078
## hs_cd_c_Log2 49.069793
## hs_co_c_Log2 51.682060
## hs_cs_c_Log2 42.322759
## hs_cu_c_Log2 66.200851
## hs_hg_c_Log2 53.722138
## hs_mo_c_Log2 59.137163
## hs_pb_c_Log2 46.948933
## hs_dde_cadj_Log2 81.399724
## hs_pcb153_cadj_Log2 86.076191
## hs_pcb170_cadj_Log2 132.902458
## hs_dep_cadj_Log2 55.402055
## hs_pbde153_cadj_Log2 100.134593
## hs_pfhxs_c_Log2 44.500787
## hs_pfoa_c_Log2 60.306554
## hs_pfos_c_Log2 51.809387
## hs_prpa_cadj_Log2 50.765210
## hs_mbzp_cadj_Log2 57.448789
## hs_mibp_cadj_Log2 45.657870
## hs_mnbp_cadj_Log2 53.036576
mse_rf <- mean((y_test - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)
cat("Chemical + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Chemical + Covariates Random Forest RMSE: 1.031425
rf_model <- train(
x_train, y_train,
method = "rf",
trControl = train_control,
tuneLength = 10
)
rf_predictions_cv <- predict(rf_model, x_test)
mse_rf_cv <- mean((y_test - rf_predictions_cv)^2)
rmse_rf_cv <- sqrt(mse_rf_cv)
cat("10-Fold CV Random Forest RMSE:", rmse_rf_cv, "\n")
## 10-Fold CV Random Forest RMSE: 1.031669
The random forest model shows a robust performance with an RMSE of 1.031 and a cross-validated RMSE of 1.032, indicating consistency and reliability. The model’s significant variables include PCB-170, PBDE-153, and PCB-153, which have high importance in predicting the outcome. Additionally, other influential variables such as DDE, copper, and PFOA are also prominent in the model. The RMSE values suggest that the model is effective at predicting the BMI Z-score based on these chemical exposures and covariates, providing insights into the factors influencing child BMI. The minimal difference between the pruned and unpruned models indicates that pruning does not significantly enhance model performance, suggesting that the initial model is already optimized. The cross-validation further confirms the model’s stability and predictive accuracy.
fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 100, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5)
summary(fit_gbm)
best_iter <- gbm.perf(fit_gbm, method = "cv")
gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_iter)
mse_gbm <- mean((y_test - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)
cat("Chemical + Covariates GBM RMSE:", rmse_gbm, "\n")
## Chemical + Covariates GBM RMSE: 1.076264
gbm_model <- train(
x_train, y_train,
method = "gbm",
trControl = train_control,
tuneLength = 10,
verbose = FALSE
)
gbm_predictions_cv <- predict(gbm_model, x_test)
mse_gbm_cv <- mean((y_test - gbm_predictions_cv)^2)
rmse_gbm_cv <- sqrt(mse_gbm_cv)
cat("10-Fold CV GBM RMSE:", rmse_gbm_cv, "\n")
## 10-Fold CV GBM RMSE: 1.040439
The GBM model’s relative influence plot shows that polychlorinated biphenyl-170 (hs_pcb170_cadj_Log2), polybrominated diphenyl ether-153 (hs_pbde153_cadj_Log2), and dichlorodiphenyldichloroethylene (hs_dde_cadj_Log2) have the highest relative influence on the model’s predictions. These chemicals were found to have the highest relative influence on the model’s predictions, contributing to its accuracy. In terms of model performance, the cross-validated GBM showed improved predictive accuracy compared to the baseline, reflecting the robustness of the model in handling different data subsets. The GBM model has an RMSE of 1.076, and the cross-validated RMSE is 1.040, indicating good predictive accuracy and generalization to unseen data.
Overall, the models for chemical and covariates data highlight the significant role of specific chemicals in predicting outcomes. The random forest model, while also effective, showed slightly higher RMSEs. Decision tree models benefited from pruning, which improved their performance, but still lagged behind the ensemble methods. The LASSO, ridge, and elastic net models provided useful insights into variable importance but had higher RMSEs compared to the GBM, reflecting the trade-off between interpretability and predictive power. Ridge regression and elastic net provide lower RMSE values, suggesting better generalization compared to LASSO.
In summary, the models consistently identified polychlorinated biphenyl-170, polybrominated diphenyl ether-153, dichlorodiphenyldichloroethylene, copper, and perfluorooctanoic acid as crucial variables for predicting BMI Z-scores.
covariates_selected <- c("hs_child_age_None", "e3_sex_None", "e3_yearbir_None")
diet_selected <- c(
"h_bfdur_Ter", "hs_bakery_prod_Ter", "hs_break_cer_Ter", "hs_dairy_Ter",
"hs_fastfood_Ter", "hs_org_food_Ter", "hs_proc_meat_Ter", "hs_total_fish_Ter",
"hs_total_fruits_Ter", "hs_total_lipids_Ter", "hs_total_sweets_Ter", "hs_total_veg_Ter"
)
chemicals_selected <- c(
"hs_cd_c_Log2", "hs_co_c_Log2", "hs_cs_c_Log2", "hs_cu_c_Log2",
"hs_hg_c_Log2", "hs_mo_c_Log2", "hs_pb_c_Log2", "hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2", "hs_pcb170_cadj_Log2", "hs_dep_cadj_Log2",
"hs_pbde153_cadj_Log2", "hs_pfhxs_c_Log2", "hs_pfoa_c_Log2",
"hs_pfos_c_Log2", "hs_prpa_cadj_Log2", "hs_mbzp_cadj_Log2", "hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2"
)
combined_selected <- c(covariates_selected, diet_selected, chemicals_selected)
chemical_diet_cov_data <- finalized_data %>% dplyr::select(all_of(c(combined_selected, "hs_zbmi_who")))
set.seed(101)
trainIndex <- createDataPartition(chemical_diet_cov_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- chemical_diet_cov_data[trainIndex,]
test_data <- chemical_diet_cov_data[-trainIndex,]
train_control <- trainControl(method = "cv", number = 10)
x_train <- model.matrix(hs_zbmi_who ~ . -1, train_data)
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, test_data)
y_test <- test_data$hs_zbmi_who
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0
num_covariates <- length(covariates_selected)
num_diet <- length(diet_selected)
num_chemicals <- length(chemicals_selected)
# make the group_indices vector
group_indices <- c(
rep(1, num_covariates), # Group 1: Covariates
rep(2, num_diet), # Group 2: Diet
rep(3, num_chemicals) # Group 3: Chemicals
)
# adjust length if needed
if (length(group_indices) < ncol(x_train)) {
group_indices <- c(group_indices, rep(4, ncol(x_train) - length(group_indices)))
}
set.seed(101)
group_lasso_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grLasso", penalty.factor = penalty_factors, family = "gaussian")
group_lasso_predictions <- predict(group_lasso_model, x_test, type = "response")
mse_group_lasso <- mean((y_test - group_lasso_predictions)^2)
rmse_group_lasso <- sqrt(mse_group_lasso)
cat("Group Lasso RMSE:", rmse_group_lasso, "\n")
## Group Lasso RMSE: 1.042529
set.seed(101)
cv_group_lasso <- cv.grpreg(x_train, y_train, group = group_indices, penalty = "grLasso", penalty.factor = penalty_factors, family = "gaussian", nfolds = 10)
cat("Optimal lambda:", cv_group_lasso$lambda.min, "\n")
## Optimal lambda: 0.02299869
coef_lasso <- coef(cv_group_lasso, s = "lambda.min")
sig_vars_lasso <- coef_lasso[coef_lasso != 0]
sig_vars_lasso <- sig_vars_lasso[names(sig_vars_lasso) != "(Intercept)"]
sig_vars_lasso_df <- data.frame(
Variable = names(sig_vars_lasso),
Coefficient = as.numeric(sig_vars_lasso)
)
sig_vars_lasso_df
group_lasso_predictions <- predict(cv_group_lasso, x_test, s = "lambda.min", type = "response")
mse_group_lasso <- mean((y_test - group_lasso_predictions)^2)
rmse_group_lasso <- sqrt(mse_group_lasso)
cat("Group LASSO RMSE with 10-fold CV:", rmse_group_lasso, "\n")
## Group LASSO RMSE with 10-fold CV: 1.031241
The cross-validated RMSE of 1.031 indicates consistent performance across different data subsets. Significant variables identified in the group LASSO model include e3_yearbir_None2009, which, with a positive coefficient of 0.349, suggests that being born in 2009 is linked to an increased BMI Z-score. Higher levels of copper (hs_cu_c_Log2) with a coefficient of 0.478 are strongly associated with an increase in BMI Z-score, while higher cadmium levels (hs_cd_c_Log2) and MiBP (hs_mibp_cadj_Log2) are linked to a decrease in BMI Z-score, with coefficients of -0.047 and -0.077, respectively. Additionally, consuming fruits within the range of 7 to 14.1 servings (hs_total_fruits_Ter(7,14.1]) is associated with an increase in BMI Z-score, as indicated by a coefficient of 0.059. These variables significantly influence the outcome based on the group LASSO model coefficients.
set.seed(101)
group_ridge_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grMCP", penalty.factor = penalty_factors, family = "gaussian")
group_ridge_predictions <- predict(group_ridge_model, x_test, type = "response")
mse_group_ridge <- mean((y_test - group_ridge_predictions)^2)
rmse_group_ridge <- sqrt(mse_group_ridge)
cat("Group Ridge RMSE:", rmse_group_ridge, "\n")
## Group Ridge RMSE: 1.042024
set.seed(101)
cv_group_ridge_model <- cv.grpreg(x_train, y_train, group = group_indices, penalty = "grMCP", penalty.factor = penalty_factors, family = "gaussian", nfolds = 10)
cat("Optimal lambda:", cv_group_ridge_model$lambda.min, "\n")
## Optimal lambda: 0.04019086
coef_ridge <- coef(cv_group_ridge_model, s = "lambda.min")
sig_vars_ridge <- coef_ridge[coef_ridge != 0]
sig_vars_ridge <- sig_vars_ridge[names(sig_vars_ridge) != "(Intercept)"]
sig_vars_ridge_df <- data.frame(
Variable = names(sig_vars_ridge),
Coefficient = as.numeric(sig_vars_ridge)
)
sig_vars_ridge_df
group_ridge_predictions <- predict(cv_group_ridge_model, x_test, s = "lambda.min", type = "response")
mse_group_ridge <- mean((y_test - group_ridge_predictions)^2)
rmse_group_ridge <- sqrt(mse_group_ridge)
cat("Group Ridge RMSE with 10-fold CV:", rmse_group_ridge, "\n")
## Group Ridge RMSE with 10-fold CV: 1.03338
The group ridge model produced a RMSE of 1.042. After performing 10-fold cross-validation, the optimal lambda value identified was 0.0402, resulting in an RMSE of 1.033. Significant variables in this model included the year of birth (e3_yearbir_None2009) with a positive coefficient of 0.465, indicating that being born in 2009 is associated with an increased BMI Z-score. Copper levels (hs_cu_c_Log2) also showed a strong positive association with a coefficient of 0.606. Higher levels of mono-benzyl phthalate (hs_mbzp_cadj_Log2) were associated with an increase in the response variable, evidenced by a coefficient of 0.110. Conversely, a high intake of breakfast cereals (hs_break_cer_Ter(5.5, Inf]) was linked to a decrease in BMI Z-score with a coefficient of -0.115. Lastly, higher levels of polychlorinated biphenyl-153 (hs_pbcb153_cadj_Log2) were associated with a decrease in the response variable, indicated by a negative coefficient of -0.199.
set.seed(101)
group_enet_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grSCAD", penalty.factor = penalty_factors, family = "gaussian")
group_enet_predictions <- predict(group_enet_model, x_test, type = "response")
mse_group_enet <- mean((y_test - group_enet_predictions)^2)
rmse_group_enet <- sqrt(mse_group_enet)
cat("Group Elastic Net RMSE:", rmse_group_enet, "\n")
## Group Elastic Net RMSE: 1.043394
set.seed(101)
cv_group_enet_model <- cv.grpreg(x_train, y_train, group = group_indices, penalty = "grSCAD", penalty.factor = penalty_factors, family = "gaussian", nfolds = 10)
cat("Optimal lambda:", cv_group_enet_model$lambda.min, "\n")
## Optimal lambda: 0.03662041
coef_enet <- coef(cv_group_enet_model, s = "lambda.min")
sig_vars_enet <- coef_enet[coef_enet != 0]
sig_vars_enet <- sig_vars_enet[names(sig_vars_enet) != "(Intercept)"]
sig_vars_enet_df <- data.frame(
Variable = names(sig_vars_enet),
Coefficient = as.numeric(sig_vars_enet)
)
sig_vars_enet_df
group_enet_predictions <- predict(cv_group_enet_model, x_test, s = "lambda.min", type = "response")
mse_group_enet <- mean((y_test - group_enet_predictions)^2)
rmse_group_enet <- sqrt(mse_group_enet)
cat("Group Elastic Net RMSE with 10-fold CV:", rmse_group_enet, "\n")
## Group Elastic Net RMSE with 10-fold CV: 1.032168
The group elastic net model initially produced an RMSE of 1.043. After performing 10-fold cross-validation, the optimal lambda value identified was 0.037, resulting in an RMSE of 1.032. Significant variables in this model included the birth year 2009 (e3_yearbir_None2009) with a positive coefficient of 0.2673, indicating an association with increased BMI Z-score. Higher log-transformed copper levels (hs_cu_c_Log2) showed a strong positive association with a coefficient of 0.544. The breastfeeding duration category (h_bfdur_Ter(10.8,34.9])) was also positively associated with an increase in the outcome variable, as suggested by a coefficient of 0.087. Conversely, higher log-transformed lead levels (hs_pb_c_Log2) were negatively associated with the outcome, indicated by a coefficient of -0.096. The birth year 2004 (e3_yearbir_None2004) was negatively associated with the outcome, as evidenced by a coefficient of -0.125.
set.seed(101)
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(fit_tree)
tree_predictions <- predict(fit_tree, newdata = test_data)
mse_tree <- mean((test_data$hs_zbmi_who - tree_predictions)^2)
rmse_tree <- sqrt(mse_tree)
cat("Diet + Chemical + Covariates Decision Tree RMSE:", rmse_tree, "\n")
## Diet + Chemical + Covariates Decision Tree RMSE: 1.128749
printcp(fit_tree)
##
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
##
## Variables actually used in tree construction:
## [1] hs_cu_c_Log2 hs_dde_cadj_Log2 hs_fastfood_Ter
## [4] hs_mnbp_cadj_Log2 hs_pbde153_cadj_Log2 hs_pcb170_cadj_Log2
## [7] hs_pfoa_c_Log2
##
## Root node error: 1328.8/913 = 1.4554
##
## n= 913
##
## CP nsplit rel error xerror xstd
## 1 0.077388 0 1.00000 1.00089 0.050861
## 2 0.032769 1 0.92261 0.93652 0.048377
## 3 0.032579 2 0.88984 0.91682 0.046750
## 4 0.025008 3 0.85726 0.91218 0.046570
## 5 0.015671 4 0.83226 0.90443 0.045712
## 6 0.013674 5 0.81658 0.91485 0.047377
## 7 0.013602 6 0.80291 0.91478 0.047609
## 8 0.013517 7 0.78931 0.91478 0.047609
## 9 0.011314 8 0.77579 0.93397 0.049086
## 10 0.011047 10 0.75317 0.96270 0.050245
## 11 0.010296 11 0.74212 0.98215 0.050293
## 12 0.010000 12 0.73182 0.98607 0.050369
plotcp(fit_tree)
optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]
pruned_tree <- prune(fit_tree, cp = optimal_cp)
rpart.plot(pruned_tree)
pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Pruned Diet + Chemical + Covariates Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Pruned Diet + Chemical + Covariates Decision Tree RMSE: 1.089746
set.seed(101)
train_control <- trainControl(method = "cv", number = 10)
cp_grid <- expand.grid(cp = seq(0.001, 0.1, by = 0.001))
pruned_tree_model <- train(
hs_zbmi_who ~ .,
data = train_data,
method = "rpart",
trControl = train_control,
tuneGrid = cp_grid
)
best_cp <- pruned_tree_model$bestTune$cp
cat("Best cp value from cross-validation:", best_cp, "\n")
## Best cp value from cross-validation: 0.023
fit_tree_unpruned <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
fit_tree_best <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova", control = rpart.control(cp = best_cp))
rpart.plot(fit_tree_unpruned)
rpart.plot(fit_tree_best)
unpruned_tree_predictions <- predict(fit_tree_unpruned, newdata = test_data)
pruned_tree_predictions <- predict(fit_tree_best, newdata = test_data)
mse_unpruned_tree <- mean((test_data$hs_zbmi_who - unpruned_tree_predictions)^2)
rmse_unpruned_tree <- sqrt(mse_unpruned_tree)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Unpruned Decision Tree RMSE:", rmse_unpruned_tree, "\n")
## Unpruned Decision Tree RMSE: 1.128749
cat("Pruned Decision Tree RMSE with Best cp:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE with Best cp: 1.089746
The unpruned decision tree model for predicting BMI Z-score in children had an RMSE of 1.129, indicating the initial model’s error rate. After applying pruning to the decision tree and with cross-validation of 10 folds, the RMSE improved to 1.090, demonstrating enhanced model performance by reducing overfitting. Pruning, which involves trimming less critical branches, helps in creating a more generalized model. Key variables identified in the decision tree include log-transformed concentrations of PCBs (hs_pcb170_cadj_Log2 and hs_pbde153_cadj_Log2), which were crucial in splitting the data. The pruned tree reveals the hierarchical influence of these variables and others such as log-transformed levels of DDE (dichlorodiphenyldichloroethylene), MiBP (Mmno-isobutyl phthalate), and PFOA (perfluorooctanoic acid) on the BMI Z-score outcomes. The best complexity parameter (cp) value from cross-validation was 0.023, indicating the optimal point at which further pruning did not improve model performance.
set.seed(101)
fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data, ntree = 500)
par(mar = c(6, 14, 4, 4) + 0.1)
varImpPlot(fit_rf, cex = 0.6)
importance(fit_rf)
## IncNodePurity
## hs_child_age_None 41.267546
## e3_sex_None 5.540866
## e3_yearbir_None 31.564485
## h_bfdur_Ter 13.136212
## hs_bakery_prod_Ter 20.689578
## hs_break_cer_Ter 13.679182
## hs_dairy_Ter 11.181729
## hs_fastfood_Ter 11.895794
## hs_org_food_Ter 10.856862
## hs_proc_meat_Ter 10.827847
## hs_total_fish_Ter 10.984827
## hs_total_fruits_Ter 13.977047
## hs_total_lipids_Ter 12.980031
## hs_total_sweets_Ter 10.749403
## hs_total_veg_Ter 12.258026
## hs_cd_c_Log2 44.886048
## hs_co_c_Log2 45.054544
## hs_cs_c_Log2 34.878871
## hs_cu_c_Log2 60.387604
## hs_hg_c_Log2 43.835480
## hs_mo_c_Log2 50.597285
## hs_pb_c_Log2 39.126962
## hs_dde_cadj_Log2 71.115165
## hs_pcb153_cadj_Log2 74.294671
## hs_pcb170_cadj_Log2 130.998985
## hs_dep_cadj_Log2 50.333164
## hs_pbde153_cadj_Log2 92.865989
## hs_pfhxs_c_Log2 38.593526
## hs_pfoa_c_Log2 53.392093
## hs_pfos_c_Log2 44.360304
## hs_prpa_cadj_Log2 43.848750
## hs_mbzp_cadj_Log2 49.661827
## hs_mibp_cadj_Log2 38.399318
## hs_mnbp_cadj_Log2 45.955350
rf_predictions <- predict(fit_rf, newdata = test_data)
mse_rf <- mean((y_test - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)
cat("Diet + Chemical + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Diet + Chemical + Covariates Random Forest RMSE: 1.024848
rf_model <- train(
x_train, y_train,
method = "rf",
trControl = train_control,
tuneLength = 10
)
rf_predictions_cv <- predict(rf_model, x_test)
mse_rf_cv <- mean((y_test - rf_predictions_cv)^2)
rmse_rf_cv <- sqrt(mse_rf_cv)
cat("10-Fold CV Random Forest RMSE:", rmse_rf_cv, "\n")
## 10-Fold CV Random Forest RMSE: 1.025544
The random forest model yielded an RMSE of 1.025 and a cross-validated RMSE of 1.026, demonstrating robust performance. Key variables contributing to this model’s effectiveness include polychlorinated biphenyl-170 (hs_pcb170_cadj_Log2), polybrominated diphenyl ether-153 (hs_pbde153_cadj_Log2), dichlorodiphenyldichloroethylene (hs_dde_cadj_Log2), copper (hs_cu_c_Log2), and perfluorooctanoic acid (hs_pfoa_c_Log2). These variables were identified as having the highest importance, significantly influencing the model’s predictive accuracy by reducing prediction errors and providing valuable insights into the factors affecting childhood BMI.
set.seed(101)
fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 100, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5)
summary(fit_gbm)
best_iter <- gbm.perf(fit_gbm, method = "cv")
gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_iter)
mse_gbm <- mean((y_test - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)
cat("Diet + Chemical + Covariates GBM RMSE:", rmse_gbm, "\n")
## Diet + Chemical + Covariates GBM RMSE: 1.072432
gbm_model <- train(
x_train, y_train,
method = "gbm",
trControl = train_control,
tuneLength = 10,
verbose = FALSE
)
gbm_predictions_cv <- predict(gbm_model, x_test)
mse_gbm_cv <- mean((y_test - gbm_predictions_cv)^2)
rmse_gbm_cv <- sqrt(mse_gbm_cv)
cat("10-Fold CV GBM RMSE:", rmse_gbm_cv, "\n")
## 10-Fold CV GBM RMSE: 1.048183
The RMSE for the GBM model is 1.072, with a cross-validated RMSE of 1.048, demonstrating good accuracy and improvement over iterations. Key variables in the model include polychlorinated biphenyl-170 (hs_pcb170_cadj_Log2), polybrominated diphenyl ether-153 (hs_pbde153_cadj_Log2), dichlorodiphenyldichloroethylene (hs_dde_cadj_Log2), perfluorooctanoic acid (hs_pfoa_c_Log2), and copper (hs_cu_c_Log2). These variables exhibit the highest relative influence, significantly contributing to reducing prediction errors and highlighting their importance in the model’s predictive capability.
The GBM model demonstrated the best overall performance with a cross-validated RMSE of 1.048, slightly better than the random forest model’s cross-validated RMSE of 1.026. This suggests that GBM’s iterative boosting approach may be more effective in capturing the complex relationships within the data.
Another notable observation is the consistent importance of certain variables across all models. Variables such as PCBs (hs_pcb170_cadj_Log2), PBDE-153 (hs_pbde153_cadj_Log2), DDE (hs_dde_cadj_Log2), copper (hs_cu_c_Log2), and PFOA (hs_pfoa_c_Log2) consistently emerged as significant predictors, highlighting their strong influence on childhood BMI Z-scores.
The decision tree model’s pruning process and cross-validation led to a substantial reduction in overfitting, improving its performance, as shown by the decrease in RMSE from 1.129 to 1.090. This underscores the importance of model regularization techniques in enhancing predictive accuracy.
The comparison between the group LASSO, ridge, and elastic net models highlights the effectiveness of regularization techniques in handling high-dimensional data. The elastic net model, with an RMSE of 1.032 after cross-validation, effectively balanced the inclusion of significant predictors, combining the strengths of both LASSO and ridge penalties.
Overall, each model revealed significant predictors of BMI Z-scores, with environmental chemicals, dietary factors, and demographic covariates playing critical roles in influencing childhood BMI.
load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/metabol_serum.RData")
metabol_serum_transposed <- as.data.frame(t(metabol_serum.d))
metabol_serum_transposed$ID <- as.integer(rownames(metabol_serum_transposed))
# add the ID column to the first position
metabol_serum_transposed <- metabol_serum_transposed[, c("ID", setdiff(names(metabol_serum_transposed), "ID"))] # ID is the first column, and the layout is preserved
# specific covariates
load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/HELIX.RData")
filtered_chem_diet <- codebook %>%
filter(domain %in% c("Chemicals", "Lifestyles") & period == "Postnatal" & subfamily != "Allergens")
# specific covariates
filtered_covariates <- codebook %>%
filter(domain == "Covariates" &
variable_name %in% c("ID", "e3_sex_None", "e3_yearbir_None", "hs_child_age_None"))
#specific phenotype variables
filtered_phenotype <- codebook %>%
filter(domain == "Phenotype" &
variable_name %in% c("hs_zbmi_who"))
# combining all necessary variables together
combined_codebook <- bind_rows(filtered_chem_diet, filtered_covariates, filtered_phenotype)
outcome_and_cov <- cbind(covariates, outcome_BMI)
outcome_and_cov <- outcome_and_cov[, !duplicated(colnames(outcome_and_cov))]
outcome_and_cov <- outcome_and_cov %>%
dplyr::select(ID, hs_child_age_None, e3_sex_None, e3_yearbir_None, hs_zbmi_who)
#the full chemicals list
chemicals_specific <- c(
"hs_cd_c_Log2",
"hs_co_c_Log2",
"hs_cs_c_Log2",
"hs_cu_c_Log2",
"hs_hg_c_Log2",
"hs_mo_c_Log2",
"hs_pb_c_Log2",
"hs_dde_cadj_Log2",
"hs_pcb153_cadj_Log2",
"hs_pcb170_cadj_Log2",
"hs_dep_cadj_Log2",
"hs_pbde153_cadj_Log2",
"hs_pfhxs_c_Log2",
"hs_pfoa_c_Log2",
"hs_pfos_c_Log2",
"hs_prpa_cadj_Log2",
"hs_mbzp_cadj_Log2",
"hs_mibp_cadj_Log2",
"hs_mnbp_cadj_Log2"
)
#postnatal diet for child
postnatal_diet <- c(
"h_bfdur_Ter",
"hs_bakery_prod_Ter",
"hs_dairy_Ter",
"hs_fastfood_Ter",
"hs_org_food_Ter",
"hs_readymade_Ter",
"hs_total_bread_Ter",
"hs_total_fish_Ter",
"hs_total_fruits_Ter",
"hs_total_lipids_Ter",
"hs_total_potatoes_Ter",
"hs_total_sweets_Ter",
"hs_total_veg_Ter"
)
covariates_selected <- c("hs_child_age_None", "e3_sex_None", "e3_yearbir_None")
all_columns <- c(chemicals_specific, postnatal_diet)
extracted_exposome <- exposome %>% dplyr::select(all_of(all_columns))
selected_id_data <- cbind(outcome_and_cov, extracted_exposome)
# ID is the common identifier in both datasets
combined_data <- merge(selected_id_data, metabol_serum_transposed, by = "ID", all = TRUE)
selected_metabolomics_data <- combined_data %>% dplyr::select(-c(ID))
selected_metabolomics_data <- selected_metabolomics_data %>% na.omit()
set.seed(101)
trainIndex <- createDataPartition(selected_metabolomics_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
metabol_train_data <- selected_metabolomics_data[trainIndex,]
metabol_test_data <- selected_metabolomics_data[-trainIndex,]
x_full_train <- model.matrix(hs_zbmi_who ~ . -1, metabol_train_data)
y_full_train <- metabol_train_data$hs_zbmi_who
x_full_test <- model.matrix(hs_zbmi_who ~ . -1, metabol_test_data)
y_full_test <- metabol_test_data$hs_zbmi_who
train_control <- trainControl(method = "cv", number = 10)
penalty_factors <- rep(1, ncol(x_full_train))
penalty_factors[colnames(x_full_train) %in% covariates_selected] <- 0
num_covariates <- length(covariates_selected)
num_diet <- length(diet_selected)
num_chemicals <- length(chemicals_selected)
num_metabolomics <- ncol(metabol_serum_transposed) - 1 # subtract for the ID column
group_indices <- c(
rep(1, num_covariates), # Group 1: Covariates
rep(2, num_diet), # Group 2: Diet
rep(3, num_chemicals), # Group 3: Chemicals
rep(4, num_metabolomics) # Group 4: Metabolomics
)
if (length(group_indices) < ncol(x_full_train)) {
group_indices <- c(group_indices, rep(5, ncol(x_full_train) - length(group_indices)))
}
#length(group_indices) == ncol(x_full_train) # should be true
# fit group lasso model
set.seed(101)
group_lasso_model <- grpreg(x_full_train, y_full_train, group = group_indices, penalty = "grLasso", penalty.factor = penalty_factors, family = "gaussian")
group_lasso_predictions <- predict(group_lasso_model, x_full_test, type = "response")
mse_group_lasso <- mean((y_full_test - group_lasso_predictions)^2)
rmse_group_lasso <- sqrt(mse_group_lasso)
cat("Group Lasso Test RMSE:", rmse_group_lasso, "\n")
## Group Lasso Test RMSE: 0.9334466
set.seed(101)
cv_group_lasso <- cv.grpreg(x_full_train, y_full_train, group = group_indices, penalty = "grLasso", penalty.factor = penalty_factors, family = "gaussian", nfolds = 10)
cat("Optimal lambda:", cv_group_lasso$lambda.min, "\n")
## Optimal lambda: 0.01400055
coef_lasso <- coef(cv_group_lasso, s = "lambda.min")
sig_vars_lasso <- coef_lasso[coef_lasso != 0]
sig_vars_lasso <- sig_vars_lasso[names(sig_vars_lasso) != "(Intercept)"]
sig_vars_lasso_df <- data.frame(
Variable = names(sig_vars_lasso),
Coefficient = as.numeric(sig_vars_lasso)
)
sig_vars_lasso_df
group_lasso_predictions <- predict(cv_group_lasso, x_full_test, s = "lambda.min", type = "response")
mse_group_lasso <- mean((y_full_test - group_lasso_predictions)^2)
rmse_group_lasso <- sqrt(mse_group_lasso)
cat("Group LASSO RMSE with 10-fold CV:", rmse_group_lasso, "\n")
## Group LASSO RMSE with 10-fold CV: 0.8748226
The group LASSO model for predicting BMI Z-score using diet, chemical, metabolomic data, and covariates produced an RMSE of 0.9334 on the test set and an improved RMSE of 0.8748 with 10-fold cross-validation. The optimal lambda for this model was found to be 0.014. The model identified several key metabolites with significant coefficients. On the positive side, metabolite 161 had the highest positive coefficient (2.8164), followed by metabolites 95 (1.3657), 88 (0.8329), 59 (0.7794), and 124 (0.7499). These positive coefficients suggest that higher levels of these metabolites are associated with an increase in BMI Z-score. Conversely, metabolite 160 had the highest negative coefficient (-3.2241), indicating a strong negative association with BMI Z-score, followed by metabolites 89 (-1.8773), 82 (-1.4959), 138 (-1.0762), and 125 (-1.0322). These negative coefficients indicate that higher levels of these metabolites are associated with a decrease in BMI Z-score.
set.seed(101)
group_ridge_model <- grpreg(x_full_train, y_full_train, group = group_indices, penalty = "grMCP", penalty.factor = penalty_factors, family = "gaussian")
group_ridge_predictions <- predict(group_ridge_model, x_full_test, type = "response")
mse_group_ridge <- mean((y_full_test - group_ridge_predictions)^2)
rmse_group_ridge <- sqrt(mse_group_ridge)
cat("Group Ridge RMSE:", rmse_group_ridge, "\n")
## Group Ridge RMSE: 0.9424999
set.seed(101)
cv_group_ridge_model <- cv.grpreg(x_full_train, y_full_train, group = group_indices, penalty = "grMCP", penalty.factor = penalty_factors, family = "gaussian", nfolds = 10)
cat("Optimal lambda:", cv_group_ridge_model$lambda.min, "\n")
## Optimal lambda: 0.02446636
coef_ridge <- coef(cv_group_ridge_model, s = "lambda.min")
sig_vars_ridge <- coef_ridge[coef_ridge != 0]
sig_vars_ridge <- sig_vars_ridge[names(sig_vars_ridge) != "(Intercept)"]
sig_vars_ridge_df <- data.frame(
Variable = names(sig_vars_ridge),
Coefficient = as.numeric(sig_vars_ridge)
)
sig_vars_ridge_df
group_ridge_predictions <- predict(cv_group_ridge_model, x_full_test, s = "lambda.min", type = "response")
mse_group_ridge <- mean((y_full_test - group_ridge_predictions)^2)
rmse_group_ridge <- sqrt(mse_group_ridge)
cat("Group Ridge RMSE with 10-fold CV:", rmse_group_ridge, "\n")
## Group Ridge RMSE with 10-fold CV: 0.8849928
The group ridge model’s optimal lambda value identified through cross-validation was 0.024, which balances the model’s complexity and fit. The top five positive coefficients include metabolomic variables: metab_161, metab_95, metab_90, metab_59, and metab_177, suggesting these metabolites are positively associated with the outcome. Conversely, the top five negative coefficients, metab_160, metab_89, metab_82, metab_135, and metab_138, indicate a negative association with BMI Z-score. The model’s RMSE for the test set was 0.942, and the RMSE with 10-fold cross-validation was 0.885, demonstrating good predictive performance and suggesting the inclusion of these variables captures significant variance related to BMI Z-score.
set.seed(101)
group_enet_model <- grpreg(x_full_train, y_full_train, group = group_indices, penalty = "grSCAD", penalty.factor = penalty_factors, family = "gaussian")
group_enet_predictions <- predict(group_enet_model, x_full_test, type = "response")
mse_group_enet <- mean((y_full_test - group_enet_predictions)^2)
rmse_group_enet <- sqrt(mse_group_enet)
cat("Group Elastic Net RMSE:", rmse_group_enet, "\n")
## Group Elastic Net RMSE: 0.9440745
set.seed(101)
cv_group_enet_model <- cv.grpreg(x_full_train, y_full_train, group = group_indices, penalty = "grSCAD", penalty.factor = penalty_factors, family = "gaussian", nfolds = 10)
cat("Optimal lambda:", cv_group_enet_model$lambda.min, "\n")
## Optimal lambda: 0.0185079
coef_enet <- coef(cv_group_enet_model, s = "lambda.min")
sig_vars_enet <- coef_enet[coef_enet != 0]
sig_vars_enet <- sig_vars_enet[names(sig_vars_enet) != "(Intercept)"]
sig_vars_enet_df <- data.frame(
Variable = names(sig_vars_enet),
Coefficient = as.numeric(sig_vars_enet)
)
sig_vars_enet_df
group_enet_predictions <- predict(cv_group_enet_model, x_full_test, s = "lambda.min", type = "response")
mse_group_enet <- mean((y_full_test - group_enet_predictions)^2)
rmse_group_enet <- sqrt(mse_group_enet)
cat("Group Elastic Net RMSE with 10-fold CV:", rmse_group_enet, "\n")
## Group Elastic Net RMSE with 10-fold CV: 0.8844928
The group elastic net model demonstrated an RMSE of 0.9441 on the test set, which improved to 0.8845 with 10-fold cross-validation, indicating a strong predictive performance. The optimal lambda value was 0.0185. The top five positive coefficients, reflecting variables with the strongest positive associations, included metabolomic variables metab_161, metab_95, metab_90, metab_59, and metab_177, with metab_161 showing the highest positive coefficient of 3.6039. Conversely, the variables metab_160, metab_89, metab_82, metab_135, and metab_138 had the most substantial negative associations, with metab_160 exhibiting the largest negative coefficient of -4.0738. These results suggest that these specific metabolites have significant associations with the BMI Z-scores, either increasing or decreasing them, which could provide insights into the biological pathways influencing BMI in children.
selected_metabolomics_data <- selected_metabolomics_data %>% na.omit()
set.seed(101)
trainIndex <- createDataPartition(selected_metabolomics_data$hs_zbmi_who, p = .7,
list = FALSE,
times = 1)
train_data <- selected_metabolomics_data[ trainIndex,]
test_data <- selected_metabolomics_data[-trainIndex,]
x_full_train <- model.matrix(hs_zbmi_who ~ ., train_data)[,-1]
y_full_train <- train_data$hs_zbmi_who
x_full_test <- model.matrix(hs_zbmi_who ~ ., test_data)[,-1]
y_full_test <- test_data$hs_zbmi_who
set.seed(101)
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
rpart.plot(fit_tree)
tree_predictions <- predict(fit_tree, newdata = test_data)
mse_tree <- mean((test_data$hs_zbmi_who - tree_predictions)^2)
rmse_tree <- sqrt(mse_tree)
cat("Full Decision Tree RMSE:", rmse_tree, "\n")
## Full Decision Tree RMSE: 1.243108
printcp(fit_tree)
##
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
##
## Variables actually used in tree construction:
## [1] hs_cu_c_Log2 hs_dde_cadj_Log2 hs_pcb170_cadj_Log2
## [4] metab_100 metab_129 metab_140
## [7] metab_141 metab_142 metab_157
## [10] metab_161 metab_176 metab_47
## [13] metab_58 metab_8 metab_94
## [16] metab_95
##
## Root node error: 1254/841 = 1.4911
##
## n= 841
##
## CP nsplit rel error xerror xstd
## 1 0.089116 0 1.00000 1.00275 0.052776
## 2 0.069078 1 0.91088 0.94590 0.050339
## 3 0.037249 2 0.84181 0.91981 0.049729
## 4 0.034324 3 0.80456 0.90111 0.048711
## 5 0.025576 4 0.77023 0.89055 0.049604
## 6 0.021936 5 0.74466 0.90166 0.048993
## 7 0.021276 6 0.72272 0.90895 0.049515
## 8 0.020244 8 0.68017 0.90807 0.050310
## 9 0.016938 9 0.65992 0.91277 0.050356
## 10 0.015553 10 0.64299 0.90306 0.050272
## 11 0.015178 11 0.62743 0.90122 0.050368
## 12 0.014756 12 0.61226 0.90066 0.050300
## 13 0.014022 13 0.59750 0.91203 0.051412
## 14 0.013859 14 0.58348 0.92227 0.051732
## 15 0.012010 15 0.56962 0.93488 0.053622
## 16 0.011088 16 0.55761 0.93327 0.053609
## 17 0.011027 17 0.54652 0.92988 0.053560
## 18 0.010000 18 0.53549 0.93467 0.053891
plotcp(fit_tree)
optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]
pruned_tree <- prune(fit_tree, cp = optimal_cp)
rpart.plot(pruned_tree)
pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Pruned Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE: 1.12821
set.seed(101)
train_control <- trainControl(method = "cv", number = 10)
cp_grid <- expand.grid(cp = seq(0.001, 0.1, by = 0.001))
pruned_tree_model <- train(
hs_zbmi_who ~ .,
data = train_data,
method = "rpart",
trControl = train_control,
tuneGrid = cp_grid
)
best_cp <- pruned_tree_model$bestTune$cp
cat("Best cp value from cross-validation:", best_cp, "\n")
## Best cp value from cross-validation: 0.036
fit_tree_unpruned <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")
fit_tree_best <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova", control = rpart.control(cp = best_cp))
rpart.plot(fit_tree_unpruned)
rpart.plot(fit_tree_best)
unpruned_tree_predictions <- predict(fit_tree_unpruned, newdata = test_data)
pruned_tree_predictions <- predict(fit_tree_best, newdata = test_data)
mse_unpruned_tree <- mean((test_data$hs_zbmi_who - unpruned_tree_predictions)^2)
rmse_unpruned_tree <- sqrt(mse_unpruned_tree)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)
cat("Unpruned Decision Tree RMSE:", rmse_unpruned_tree, "\n")
## Unpruned Decision Tree RMSE: 1.243108
cat("Pruned Decision Tree RMSE with Best cp:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE with Best cp: 1.127666
The key variables identified include polychlorinated biphenyl-170 (hs_pcb170_cadj_Log2), metab_95, metab_129, and several metabolomics variables such as metab_161, metab_140, and metab_176. The unpruned model had an RMSE of 1.243, while the pruned model, which used the optimal cp value from cross-validation, had an RMSE of 1.128. This indicates that pruning improved the model’s accuracy by reducing overfitting, resulting in better performance on the test data.
set.seed(101)
rf_model <- randomForest(hs_zbmi_who ~ . , data = train_data, ntree = 500)
importance(rf_model)
## IncNodePurity
## hs_child_age_None 4.7927709
## e3_sex_None 0.5707810
## e3_yearbir_None 5.3687139
## hs_cd_c_Log2 4.8057550
## hs_co_c_Log2 5.5171622
## hs_cs_c_Log2 5.0356667
## hs_cu_c_Log2 12.4603237
## hs_hg_c_Log2 6.5609155
## hs_mo_c_Log2 9.6711331
## hs_pb_c_Log2 6.0711093
## hs_dde_cadj_Log2 13.2357957
## hs_pcb153_cadj_Log2 42.5218764
## hs_pcb170_cadj_Log2 78.6536355
## hs_dep_cadj_Log2 5.7451267
## hs_pbde153_cadj_Log2 28.0252092
## hs_pfhxs_c_Log2 5.4200119
## hs_pfoa_c_Log2 9.0245854
## hs_pfos_c_Log2 6.9947002
## hs_prpa_cadj_Log2 5.0401167
## hs_mbzp_cadj_Log2 4.4169868
## hs_mibp_cadj_Log2 4.0028339
## hs_mnbp_cadj_Log2 5.0996136
## h_bfdur_Ter 2.8968599
## hs_bakery_prod_Ter 2.9013737
## hs_dairy_Ter 1.2216138
## hs_fastfood_Ter 0.9580573
## hs_org_food_Ter 1.1570809
## hs_readymade_Ter 1.4572844
## hs_total_bread_Ter 1.1592172
## hs_total_fish_Ter 1.4961459
## hs_total_fruits_Ter 1.1767595
## hs_total_lipids_Ter 1.3061318
## hs_total_potatoes_Ter 1.2547008
## hs_total_sweets_Ter 0.9540810
## hs_total_veg_Ter 1.0142937
## metab_1 4.5011841
## metab_2 4.1531041
## metab_3 3.4379265
## metab_4 5.7831834
## metab_5 2.9178829
## metab_6 9.2084145
## metab_7 4.5028527
## metab_8 36.1613265
## metab_9 2.8921966
## metab_10 3.3253911
## metab_11 4.0194203
## metab_12 3.3756205
## metab_13 4.1227190
## metab_14 5.2837840
## metab_15 4.4314031
## metab_16 2.6462109
## metab_17 2.4025892
## metab_18 3.0563512
## metab_19 2.2951339
## metab_20 3.9067077
## metab_21 2.8607845
## metab_22 2.6869075
## metab_23 2.8972758
## metab_24 3.7219939
## metab_25 3.4797174
## metab_26 7.4516177
## metab_27 3.3263961
## metab_28 3.6858414
## metab_29 3.3029111
## metab_30 18.5312981
## metab_31 3.4436727
## metab_32 3.0545040
## metab_33 5.2038360
## metab_34 2.4118531
## metab_35 7.7143364
## metab_36 3.8489845
## metab_37 3.5293186
## metab_38 2.5404708
## metab_39 2.5165610
## metab_40 5.1546331
## metab_41 3.8921235
## metab_42 6.1284140
## metab_43 3.4008043
## metab_44 3.5698540
## metab_45 3.6758543
## metab_46 5.1802108
## metab_47 6.2120208
## metab_48 11.7806465
## metab_49 33.0932505
## metab_50 9.9343588
## metab_51 6.2231737
## metab_52 3.3042007
## metab_53 5.1286048
## metab_54 4.9482602
## metab_55 7.9980726
## metab_56 4.6904911
## metab_57 4.7919533
## metab_58 3.2769554
## metab_59 5.7319035
## metab_60 3.9192335
## metab_61 3.0283668
## metab_62 4.3267745
## metab_63 4.8698592
## metab_64 3.9307501
## metab_65 3.8785658
## metab_66 2.5032525
## metab_67 2.7330496
## metab_68 3.5029673
## metab_69 2.4526201
## metab_70 3.2796735
## metab_71 4.6222410
## metab_72 3.4044271
## metab_73 3.0460977
## metab_74 2.7018733
## metab_75 3.8889307
## metab_76 2.8113720
## metab_77 4.7832198
## metab_78 4.9096271
## metab_79 4.0912870
## metab_80 3.5334220
## metab_81 3.3897748
## metab_82 4.6950398
## metab_83 3.3638699
## metab_84 3.0885189
## metab_85 4.6423208
## metab_86 3.2104952
## metab_87 3.8282082
## metab_88 2.8852207
## metab_89 2.7940235
## metab_90 2.6775937
## metab_91 3.2999703
## metab_92 3.4644294
## metab_93 2.8208312
## metab_94 8.2011811
## metab_95 51.6744901
## metab_96 8.1118854
## metab_97 3.0816526
## metab_98 3.2655000
## metab_99 4.4506345
## metab_100 3.9391564
## metab_101 3.6374127
## metab_102 5.7070910
## metab_103 4.1270357
## metab_104 5.6319496
## metab_105 3.3000728
## metab_106 3.4540736
## metab_107 3.8068032
## metab_108 4.0689264
## metab_109 5.3846332
## metab_110 7.8695016
## metab_111 2.7218197
## metab_112 2.7613608
## metab_113 5.5357192
## metab_114 4.3740721
## metab_115 4.3476546
## metab_116 6.1735729
## metab_117 7.0544259
## metab_118 2.4601605
## metab_119 5.4450427
## metab_120 7.1242577
## metab_121 3.7051871
## metab_122 6.2282665
## metab_123 3.2628782
## metab_124 3.3382452
## metab_125 3.4201616
## metab_126 2.4057902
## metab_127 5.0861475
## metab_128 6.4178268
## metab_129 4.2203870
## metab_130 3.4512521
## metab_131 3.2548319
## metab_132 3.2654363
## metab_133 2.6708217
## metab_134 3.2328840
## metab_135 5.2587159
## metab_136 5.0406508
## metab_137 6.3400908
## metab_138 4.7455979
## metab_139 3.7303384
## metab_140 2.7325221
## metab_141 7.9353350
## metab_142 13.5320442
## metab_143 9.6865623
## metab_144 3.9033383
## metab_145 4.2785181
## metab_146 4.2770667
## metab_147 3.5426166
## metab_148 3.3025573
## metab_149 5.6260285
## metab_150 4.9789354
## metab_151 3.5939464
## metab_152 4.7580724
## metab_153 4.3991465
## metab_154 5.0953881
## metab_155 2.5069093
## metab_156 2.8819060
## metab_157 4.3767383
## metab_158 3.9225685
## metab_159 3.3643529
## metab_160 7.0114089
## metab_161 28.1188870
## metab_162 3.9630717
## metab_163 14.4399347
## metab_164 7.1747718
## metab_165 3.9310471
## metab_166 3.8429719
## metab_167 3.3589423
## metab_168 3.1719266
## metab_169 4.2720474
## metab_170 4.2149550
## metab_171 4.9419993
## metab_172 3.7296383
## metab_173 4.0956131
## metab_174 4.6193564
## metab_175 3.7661891
## metab_176 6.2453467
## metab_177 14.2757880
par(mar = c(6, 14, 4, 4) + 0.1)
varImpPlot(rf_model, cex = 0.6)
rf_predictions <- predict(rf_model, newdata = test_data)
rf_mse <- mean((rf_predictions - y_test)^2)
rmse_rf <- sqrt(rf_mse)
cat("Diet + Chemical + Metabolomic + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Diet + Chemical + Metabolomic + Covariates Random Forest RMSE: 1.247886
rf_model <- train(
x_full_train, y_full_train,
method = "rf",
trControl = train_control,
tuneLength = 10
)
rf_predictions_cv <- predict(rf_model, x_full_test)
mse_rf_cv <- mean((y_full_test - rf_predictions_cv)^2)
rmse_rf_cv <- sqrt(mse_rf_cv)
cat("10-Fold CV Random Forest RMSE:", rmse_rf_cv, "\n")
## 10-Fold CV Random Forest RMSE: 1.004854
The IncNodePurity values indicate that “hs_pcb170_cadj_Log2” is the most influential variable, followed by “metab_95,” “hs_pcb153_cadj_Log2,” and “metab_8.” The model has an RMSE of 1.0047 for the training set and 1.0048 for the 10-fold cross-validation, suggesting a consistent performance with minimal overfitting. Comparing the pruned and unpruned models, the pruned model with cross-validation resulted in a lower RMSE, indicating better generalization and reduced complexity without sacrificing accuracy.
gbm_model_full <- gbm(hs_zbmi_who ~ ., data = train_data,
distribution = "gaussian",
n.trees = 1000,
interaction.depth = 3,
n.minobsinnode = 10,
shrinkage = 0.01,
cv.folds = 5,
verbose = TRUE)
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.4860 nan 0.0100 0.0031
## 2 1.4799 nan 0.0100 0.0042
## 3 1.4751 nan 0.0100 0.0037
## 4 1.4720 nan 0.0100 0.0007
## 5 1.4668 nan 0.0100 0.0043
## 6 1.4622 nan 0.0100 0.0017
## 7 1.4565 nan 0.0100 0.0049
## 8 1.4519 nan 0.0100 0.0038
## 9 1.4470 nan 0.0100 0.0034
## 10 1.4429 nan 0.0100 0.0026
## 20 1.3973 nan 0.0100 0.0033
## 40 1.3199 nan 0.0100 0.0026
## 60 1.2519 nan 0.0100 0.0015
## 80 1.1922 nan 0.0100 0.0020
## 100 1.1425 nan 0.0100 0.0008
## 120 1.0943 nan 0.0100 0.0013
## 140 1.0515 nan 0.0100 0.0003
## 160 1.0125 nan 0.0100 0.0003
## 180 0.9736 nan 0.0100 0.0003
## 200 0.9397 nan 0.0100 0.0008
## 220 0.9070 nan 0.0100 0.0012
## 240 0.8800 nan 0.0100 0.0009
## 260 0.8535 nan 0.0100 -0.0001
## 280 0.8275 nan 0.0100 0.0005
## 300 0.8029 nan 0.0100 0.0002
## 320 0.7809 nan 0.0100 0.0003
## 340 0.7594 nan 0.0100 0.0003
## 360 0.7400 nan 0.0100 0.0000
## 380 0.7220 nan 0.0100 0.0000
## 400 0.7042 nan 0.0100 0.0003
## 420 0.6874 nan 0.0100 0.0002
## 440 0.6711 nan 0.0100 0.0003
## 460 0.6552 nan 0.0100 0.0000
## 480 0.6392 nan 0.0100 -0.0001
## 500 0.6237 nan 0.0100 -0.0002
## 520 0.6107 nan 0.0100 0.0000
## 540 0.5974 nan 0.0100 0.0000
## 560 0.5852 nan 0.0100 -0.0002
## 580 0.5728 nan 0.0100 0.0001
## 600 0.5612 nan 0.0100 -0.0001
## 620 0.5500 nan 0.0100 -0.0004
## 640 0.5386 nan 0.0100 0.0002
## 660 0.5285 nan 0.0100 0.0000
## 680 0.5187 nan 0.0100 -0.0001
## 700 0.5086 nan 0.0100 -0.0001
## 720 0.4995 nan 0.0100 -0.0001
## 740 0.4900 nan 0.0100 0.0001
## 760 0.4813 nan 0.0100 -0.0001
## 780 0.4729 nan 0.0100 -0.0000
## 800 0.4645 nan 0.0100 -0.0001
## 820 0.4565 nan 0.0100 -0.0002
## 840 0.4491 nan 0.0100 -0.0001
## 860 0.4409 nan 0.0100 0.0000
## 880 0.4336 nan 0.0100 -0.0002
## 900 0.4267 nan 0.0100 -0.0001
## 920 0.4197 nan 0.0100 -0.0001
## 940 0.4133 nan 0.0100 -0.0001
## 960 0.4067 nan 0.0100 -0.0000
## 980 0.4007 nan 0.0100 -0.0002
## 1000 0.3943 nan 0.0100 -0.0001
summary(gbm_model_full)
# finding the best number of trees based on cross-validation
best_trees_full <- gbm.perf(gbm_model_full, method = "cv")
predictions_gbm_full <- predict(gbm_model_full, test_data, n.trees = best_trees_full)
mse_gbm_full <- mean((y_full_test - predictions_gbm_full)^2)
rmse_gbm_full <- sqrt(mse_gbm_full)
cat("Diet + Chemical + Metabolomic + Covariates GBM RMSE:", rmse_gbm_full, "\n")
## Diet + Chemical + Metabolomic + Covariates GBM RMSE: 0.9537
gbm_model_full <- train(
x_full_train, y_full_train,
method = "gbm",
trControl = train_control,
tuneLength = 10,
verbose = FALSE
)
gbm_predictions_cv <- predict(gbm_model_full, x_full_test)
mse_gbm_cv <- mean((y_full_test - gbm_predictions_cv)^2)
rmse_gbm_cv <- sqrt(mse_gbm_cv)
cat("10-Fold CV GBM RMSE:", rmse_gbm_cv, "\n")
## 10-Fold CV GBM RMSE: 0.9526345
The GBM model shows that the most influential variable is polychlorinated biphenyl-170 (hs_pcb170_cadj_Log2), followed by metab_95, metab_161, metab_8, and metab_49. The GBM model’s RMSE is 0.954, indicating good predictive performance. The cross-validated RMSE is 0.953, showing consistent performance across different folds. The variables polychlorinated biphenyl-170 and metab_95 have the highest relative influence, suggesting significant roles in predicting the outcome.
In the model comparison for the diet, chemical, metabolomic, and covariates dataset, the GBM performed the best with an RMSE of 0.9537, closely followed by the group LASSO model with an RMSE of 0.8748. The random forest model also showed strong performance with an RMSE of 1.0047. Key variables that consistently appeared across the models as important predictors include specific metabolites (metab_161, metab_95, metab_160, metab_8) and chemical exposure variables (hs_pcb170_cadj_Log2, hs_pbde153_cadj_Log2). These variables highlight the critical influence of specific PCB levels and metabolites in predicting health outcomes.
Throughout the analysis, specific variables consistently emerged as significant predictors of BMI Z-scores in children. Variables such as polychlorinated biphenyl-170 (hs_pcb170_cadj_Log2), polybrominated diphenyl ether-153 (hs_pbde153_cadj_Log2), and various metabolites (metab_161, metab_95, metab_160) were highlighted across multiple models, indicating their strong influence on the outcome. Notably, higher levels of copper (hs_cu_c_Log2) and certain metabolites (metab_161, metab_95) were consistently associated with an increased BMI Z-score, suggesting their role in contributing to higher BMI. Conversely, variables such as cadmium (hs_cd_c_Log2) and specific metabolites (metab_160, metab_89) showed negative associations with BMI Z-score, indicating a potential protective effect against increased BMI. These findings imply that both environmental chemical exposures and specific metabolic profiles play crucial roles in childhood BMI outcomes, emphasizing the need for comprehensive interventions targeting these factors to mitigate obesity risk and promote healthier growth trajectories.
The LASSO, ridge, and elastic net models consistently demonstrated strong performance across various datasets, achieving the lowest RMSEs, particularly when including diet, chemical, metabolomic, and covariates. The incorporation of metabolomic data significantly enhanced model performance, underscoring their predictive power concerning BMI. Decision trees generally exhibited moderate performance, with pruning enhancing RMSE values. Nonetheless, the full decision tree models were less effective compared to regularized regression models. Random Forest and GBM models exhibited robust performance, with GBM marginally surpassing Random Forest. The integration of diverse data types (diet, chemical, metabolomic) led to improved RMSE values, showcasing the advantage of a comprehensive data approach.
Among the models tested, the GBM with all variables exhibited the best overall performance with an RMSE of 0.9537, closely followed by the group LASSO model with an RMSE of 0.8748, demonstrating superior predictive accuracy and generalization capabilities.
The results imply that certain environmental chemicals and metabolites play crucial roles in influencing BMI Z-scores in children. The high predictive accuracy of models like GBM and group LASSO underscores the importance of these variables. Higher levels of certain metabolites and chemical exposures were associated with significant changes in BMI Z-scores, suggesting potential areas for intervention to manage or mitigate obesity in children. The findings indicate that environmental and dietary factors, alongside inherent covariates like age, have substantial impacts on children’s health outcomes, specifically BMI.
One of the limitations of this study is the exclusion of 103 rows with missing values from the combined dataset, which represents approximately 8% of the total data. This decision was made to ensure the integrity and robustness of our analysis, as many statistical methods and machine learning algorithms require complete data. However, this exclusion may have introduced some issues. The excluded rows might have specific characteristics that differ from the rest of the dataset. As a result, the cleaned dataset may not fully capture the diversity and variability of the original population, potentially limiting the representativeness of the study’s findings. In future studies, alternative methods such as imputation techniques could be explored to handle missing data, potentially preserving more information and reducing the risk of bias.
Some covariates related to pregnancy, such as sex and year of birth, were included in the analysis as they were considered potentially relevant. However, their direct relevance to the specific outcomes studied might be limited, which could introduce some noise or confounding effects. There may have been other variables that might have been relevant but have been overlooked in this study. So for future studies, it might be possible to look at the impact of other potential covariates and confounders on the outcomes of interest.
Caspersen, I. H. et al. (2016). Determinants of persistent organic pollutant concentrations in plasma from 300 Norwegian women and their 3-year-old children. Environmental Research, 146, 18-26. DOI: 10.1016/j.envres.2015.12.008
Junge, K. M. et al. (2018). MEST mediates the impact of prenatal bisphenol A exposure on long-term body weight development. Clinical Epigenetics, 10(1), 58. DOI: 10.1186/s13148-018-0478-z
Harley, K. G. et al. (2013). Association of prenatal and childhood PBDE and PCB exposures with timing of puberty in boys and girls. Environmental International, 51, 264-272. DOI: 10.1016/j.envint.2012.10.007
categorical_diet <- dietary_exposome %>%
dplyr::select(where(is.factor))
categorical_diet_long <- pivot_longer(
categorical_diet,
cols = everything(),
names_to = "variable",
values_to = "value"
)
unique_categorical_vars <- unique(categorical_diet_long$variable)
categorical_plots <- lapply(unique_categorical_vars, function(var) {
data <- filter(categorical_diet_long, variable == var)
p <- ggplot(data, aes(x = value, fill = value)) +
geom_bar(stat = "count") +
labs(title = paste("Distribution of", var), x = var, y = "Count")
print(p)
return(p)
})
Breastfeeding Duration: Majority of observations are in the highest duration category, suggesting longer breastfeeding periods are common.
Bakery Products: Shows a relatively even distribution across the three categories, indicating varied consumption levels of bakery products among participants.
Breakfast Cereal: The highest category of cereal consumption is the most common, suggesting a preference for or greater consumption of cereals.
Dairy: Shows a fairly even distribution across all categories, indicating a uniform consumption pattern of dairy products.
Fast Food: Most participants fall into the middle category, indicating moderate consumption of fast food.
Organic Food: Most participants either consume a lot of or no organic food, with fewer in the middle range.
Processed Meat: Consumption levels are fairly evenly distributed, indicating varied dietary habits regarding processed meats.
Bread: Distribution shows a significant leaning towards higher bread consumption.
Cereal: Even distribution across categories suggests varied cereal consumption habits.
Fish and Seafood: Even distribution across categories, indicating varied consumption of fish and seafood.
Fruits: High fruit consumption is the most common, with fewer participants in the lowest category.
Added Fats: More participants consume added fats at the lowest and highest levels, with fewer in the middle.
Sweets: High consumption of sweets is the most common, indicating a preference for or higher access to sugary foods.
Vegetables: Most participants consume a high amount of vegetables.
#separate numeric and categorical data
numeric_chemical <- chemical_exposome %>%
dplyr::select(where(is.numeric))
numeric_chemical_long <- pivot_longer(
numeric_chemical,
cols = everything(),
names_to = "variable",
values_to = "value"
)
unique_numerical_vars <- unique(numeric_chemical_long$variable)
num_plots <- lapply(unique_numerical_vars, function(var) {
data <- filter(numeric_chemical_long, variable == var)
p <- ggplot(data, aes(x = value)) +
geom_histogram(bins = 30, fill = "blue") +
labs(title = paste("Histogram of", var), x = "Value", y = "Count")
print(p)
return(p)
})
Cadmium (hs_cd_c_Log2): The histogram for cadmium exposure shows a relatively symmetric distribution centered around 4 on the log2 scale. Most values range from approximately 3 to 5, with few outliers.
Cobalt (hs_co_c_Log2): The histogram of cobalt levels displays a roughly normal distribution centered around a slight positive skew, peaking around 3.5.
Cesium (hs_cs_c_Log2): Exhibits a right-skewed distribution, indicating that most participants have relatively low exposure levels, but a small number have substantially higher exposures. Majority of the data centered around 1.5 to 2.5
Copper (hs_cu_c_Log2): Shows a right-skewed distribution, suggesting that while most individuals have moderate exposure, a few experience significantly higher levels of copper.
Mercury (hs_hg_c_Log2): This distribution is also right-skewed, common for environmental pollutants, where a majority have lower exposure levels, and a minority have high exposure levels.
Molybdenum (hs_mo_c_Log2): Shows a distribution with a sharp peak and a long right tail, suggesting that while most people have similar exposure levels, a few have exceptionally high exposures. Has a sharp peak around 6, indicating that most values fall within a narrow range.
Lead (hs_pb_c_Log2): The distribution is slightly right-skewed, indicating higher exposure levels in a smaller group of the population compared to the majority.
DDE (hs_dde_cadj_Log2): Shows a pronounced right skew, typical for chemicals that accumulate in the environment and in human tissues, indicating higher levels of exposure in a smaller subset of the population..
PCB 153 (hs_pcb153_cadj_Log2): Has a distribution with right skewness, suggesting that exposure to these compounds is higher among a smaller segment of the population. Bimodal, indicating two peaks around 2 and 4.
PCB 170 (hs_pcb170_cadj_Log2): This histograms show a significant right skew, indicating lower concentrations of these chemicals in most samples, with fewer samples showing higher concentrations. This pattern suggests that while most individuals have low exposure, a few may have considerably higher levels.
DEP (hs_dep_cadj_Log2): DEP exposure has a sharp peak around 6, indicating a narrow distribution of values.
PBDE 153 (hs_pbde153_cadj_Log2): This histogram shows a bimodal distribution, with peaks around 1 and 4.
PFHxS: Distribution peaks around 5 with a broad spread, indicating variation in PFHxS levels.
PFOA: Shows a peak around 6 with a symmetrical distribution, indicating consistent PFOA levels.
PFOS: Similar to PFOA, the distribution peaks around 6, indicating consistent PFOS levels.
Propyl Paraben (hs_prpa_cadj_Log2): The distribution peaks around 6 with a broad spread, indicating variability in propyl paraben levels.
Monobenzyl Phthalate (hs_mbzp_cadj_Log2): This histogram shows a right-skewed distribution. Most values cluster at the lower end, indicating a common lower exposure level among subjects, with a long tail towards higher values suggesting occasional higher exposures. Shows a broad distribution with a peak around 4, indicating variation in monobenzyl phthalate levels. Indicates consistent but varied exposure levels.
Monoisobutyl Phthalate (hs_mibp_cadj_Log2): The distribution is right-skewed, similar to MBZP, but with a smoother decline. This pattern also indicates that while most subjects have lower exposure levels, a few experience significantly higher exposures.
Mono-n-butyl Phthalate (hs_mnbp_cadj_Log2): Peaks around 4, indicating consistent exposure levels with some variation. Few outliers are present.
#separate numeric and categorical data
numeric_covariates <- covariate_data %>%
dplyr::select(where(is.numeric))
numeric_covariates_long <- pivot_longer(
numeric_covariates,
cols = everything(),
names_to = "variable",
values_to = "value"
)
unique_numerical_vars <- unique(numeric_covariates_long$variable)
num_plots <- lapply(unique_numerical_vars, function(var) {
data <- filter(numeric_covariates_long, variable == var)
p <- ggplot(data, aes(x = value)) +
geom_histogram(bins = 30, fill = "blue") +
labs(title = paste("Histogram of", var), x = "Value", y = "Count")
print(p)
return(p)
})
Child’s Age (hs_child_age): This histogram is multimodal, reflecting several peaks across different ages. This could be indicative of the data collection points or particular age groups being studied.
categorical_covariates <- covariate_data %>%
dplyr::select(where(is.factor))
categorical_covariates_long <- pivot_longer(
categorical_covariates,
cols = everything(),
names_to = "variable",
values_to = "value"
)
unique_categorical_vars <- unique(categorical_covariates_long$variable)
categorical_plots <- lapply(unique_categorical_vars, function(var) {
data <- filter(categorical_covariates_long, variable == var)
p <- ggplot(data, aes(x = value, fill = value)) +
geom_bar(stat = "count") +
labs(title = paste("Distribution of", var), x = var, y = "Count")
print(p)
return(p)
})
Gender Distribution (e3_sex): The gender distribution is nearly balanced with a slight higher count for males compared to females.
Year of Birth (e3_yearbir): This chart shows that the majority of subjects were born in the later years, with a significant increase in 2009, indicating perhaps a larger recruitment or a specific cohort focus that year.
# summary of the model to extract feature importance
importance_full <- summary(gbm_model_full, n.trees = best_trees)
#data frame for the top 5 features
top5_importance_full <- importance_full[1:5, ]
ggplot(top5_importance_full, aes(x = reorder(var, rel.inf), y = rel.inf)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 5 Most Important Features in GBM Model",
x = "Feature",
y = "Relative Importance")
results <- data.frame(
Model = rep(c("Lasso", "Ridge", "Elastic Net", "Decision Tree", "Random Forest", "GBM"), each = 5),
Data_Set = rep(c("Covariates", "Diet + Covariates", "Chemicals + Covariates", "Diet + Chemical + Covariates", "Diet + Chemical + Metabolomic + Covariates"), 6),
RMSE = c(
1.152017, 1.139468, 1.040727, 1.031241, 0.8748226, # Lasso
1.149128, 1.129036, 1.035411, 1.03338, 0.8849928, # Ridge
1.152017, 1.138767, 1.035456, 1.032168, 0.8844928, # Elastic Net
1.155427, 1.152094, 1.089746, 1.089746, 1.127666, # Decision Tree
1.154789, 1.129478, 1.031669, 1.025544, 1.004854, # Random Forest
1.149695, 1.136542, 1.040439, 1.048183, 0.9526345 # GBM
)
)
results$Data_Set <- factor(results$Data_Set, levels = c("Covariates", "Diet + Covariates", "Chemicals + Covariates", "Diet + Chemical + Covariates", "Diet + Chemical + Metabolomic + Covariates"))
rmse_plot <- ggplot(results, aes(x = Data_Set, y = RMSE, fill = Model)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
rmse_plot
# filter results for Lasso model
lasso_results <- subset(results, Model == "Lasso")
rmse_lasso_plot <- ggplot(lasso_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Lasso Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
rmse_lasso_plot
# filter results for ridge model
ridge_results <- subset(results, Model == "Ridge")
rmse_ridge_plot <- ggplot(ridge_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Ridge Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
rmse_ridge_plot
# filter results for elastic net model
enet_results <- subset(results, Model == "Elastic Net")
rmse_enet_plot <- ggplot(enet_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Elastic Net Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
rmse_enet_plot
tree_results <- data.frame(
Data_Set = c("Covariates", "Diet + Covariates", "Chemicals + Covariates", "Diet + Chemical + Covariates", "Diet + Chemical + Metabolomic + Covariates"),
RMSE = c(1.155427, 1.152094, 1.089746, 1.089746, 1.127666)
)
tree_results$Data_Set <- factor(tree_results$Data_Set, levels = c("Covariates", "Diet + Covariates", "Chemicals + Covariates", "Diet + Chemical + Covariates", "Diet + Chemical + Metabolomic + Covariates"))
rmse_tree_plot <- ggplot(tree_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Decision Tree Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
rmse_tree_plot
# filter results for random forest model
rf_results <- data.frame(
Data_Set = c("Covariates", "Diet + Covariates", "Chemicals + Covariates", "Diet + Chemical + Covariates", "Diet + Chemical + Metabolomic + Covariates"),
RMSE = c(1.154789, 1.129478, 1.031669, 1.025544, 1.004854)
)
rf_results$Data_Set <- factor(tree_results$Data_Set, levels = c("Covariates", "Diet + Covariates", "Chemicals + Covariates", "Diet + Chemical + Covariates", "Diet + Chemical + Metabolomic + Covariates"))
rmse_rf_plot <- ggplot(rf_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Random Forest Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
rmse_rf_plot
# filter results for elastic net model
gbm_results <- data.frame(
Data_Set = c("Covariates", "Diet + Covariates", "Chemicals + Covariates", "Diet + Chemical + Covariates", "Diet + Chemical + Metabolomic + Covariates"),
RMSE = c(1.149695, 1.136542, 1.040439, 1.048183, 0.9526345)
)
gbm_results$Data_Set <- factor(gbm_results$Data_Set, levels = c("Covariates", "Diet + Covariates", "Chemicals + Covariates", "Diet + Chemical + Covariates", "Diet + Chemical + Metabolomic + Covariates"))
rmse_gbm_plot <- ggplot(gbm_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Gradient Boost Machine Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
rmse_gbm_plot